A Video-Based Reverse Dictionary for Sign Language Using Gesture Similarity

Orazumbekov, Batyrbek

A Video-Based Reverse Dictionary for Sign Language Using Gesture Similarity

dc.contributor.advisor	Sandygulova, Anara
dc.contributor.author	Orazumbekov, Batyrbek
dc.date.accessioned	2026-05-29T05:28:52Z
dc.date.issued	2026-05-04
dc.description.abstract	Most sign language recognition systems have been developed as classification models that associate gesture videos with pre-defined glosses, but such systems do not facilitate similarity search, where users can make queries without knowing the labels of gestures. This thesis proposes a sign language retrieval system based on pose representation that functions as a reverse gesture dictionary, enabling users to retrieve visually similar gestures directly from video input. The proposed method converts gestures into normalized skeletal joints rather than RGB images to minimize variations in appearance, such as background, lighting, and clothing, and to focus on dynamic motion patterns. The extracted keypoints are temporally normalized and optionally augmented with motion features to better capture gesture dynamics. In order to model the temporal relationships within the data, two models are considered; one being a Transformer model with a self-attention mechanism and another one being a Spatial-Temporal Graph Convolutional Network (ST-GCN). Both of these can be used to compare the capabilities of sequence models in modeling temporal dependencies. The model was evaluated using the WLASL dataset under the few-shot setting and ranking metrics like Recall@K and mean Average Precision (mAP), rather than using classification accuracy as it better suits a retrieval task. According to experimental results, it can be concluded that the Transformer model performs better when it comes to modeling temporal relationships between frames in gesture sequences compared to graph-based models. Additionally, employing attention-driven pooling during temporal aggregation improves the results significantly and achieves an mAP of 0.237 on the validation set. Transferability of the embedding space to novel gestures is tested by applying the trained model to the AUTSL dataset (using only a subset of 226 labels). Finally, the impact of approximate nearest neighbor search on retrieval results is examined.
dc.identifier.citation	Orazumbekov, Batyrbek. (2026). A Video-Based Reverse Dictionary for Sign Language Using Gesture Similarity. Nazarbayev University School of Engineering and Digital Sciences
dc.identifier.uri	https://nur.nu.edu.kz/handle/123456789/18782
dc.language.iso	en
dc.publisher	Nazarbayev University School of Engineering and Digital Sciences
dc.rights	Attribution 3.0 United States	en
dc.rights.uri	http://creativecommons.org/licenses/by/3.0/us/
dc.subject	sign language recognition
dc.subject	gesture retrieval
dc.subject	pose estimation
dc.subject	Transformer
dc.subject	ST-GCN
dc.subject	few-shot learning
dc.subject	embedding space
dc.subject	approximate nearest neighbor search
dc.subject	WLASL
dc.subject	PQDT_Master
dc.title	A Video-Based Reverse Dictionary for Sign Language Using Gesture Similarity
dc.type	Master`s thesis

Files

Original bundle

Now showing 1 - 1 of 1

Name:: A Video-Based Reverse Dictionary for Sign Language Using Gesture Similarity.pdf
Size:: 2.24 MB
Format:: Adobe Portable Document Format
Description:: Master`s thesis

Download

Collections

02. Master's Thesis