A Video-Based Reverse Dictionary for Sign Language Using Gesture Similarity

dc.contributor.advisorSandygulova, Anara
dc.contributor.authorOrazumbekov, Batyrbek
dc.date.accessioned2026-05-29T05:28:52Z
dc.date.issued2026-05-04
dc.description.abstractMost sign language recognition systems have been developed as classification models that associate gesture videos with pre-defined glosses, but such systems do not facilitate similarity search, where users can make queries without knowing the labels of gestures. This thesis proposes a sign language retrieval system based on pose representation that functions as a reverse gesture dictionary, enabling users to retrieve visually similar gestures directly from video input. The proposed method converts gestures into normalized skeletal joints rather than RGB images to minimize variations in appearance, such as background, lighting, and clothing, and to focus on dynamic motion patterns. The extracted keypoints are temporally normalized and optionally augmented with motion features to better capture gesture dynamics. In order to model the temporal relationships within the data, two models are considered; one being a Transformer model with a self-attention mechanism and another one being a Spatial-Temporal Graph Convolutional Network (ST-GCN). Both of these can be used to compare the capabilities of sequence models in modeling temporal dependencies. The model was evaluated using the WLASL dataset under the few-shot setting and ranking metrics like Recall@K and mean Average Precision (mAP), rather than using classification accuracy as it better suits a retrieval task. According to experimental results, it can be concluded that the Transformer model performs better when it comes to modeling temporal relationships between frames in gesture sequences compared to graph-based models. Additionally, employing attention-driven pooling during temporal aggregation improves the results significantly and achieves an mAP of 0.237 on the validation set. Transferability of the embedding space to novel gestures is tested by applying the trained model to the AUTSL dataset (using only a subset of 226 labels). Finally, the impact of approximate nearest neighbor search on retrieval results is examined.
dc.identifier.citationOrazumbekov, Batyrbek. (2026). A Video-Based Reverse Dictionary for Sign Language Using Gesture Similarity. Nazarbayev University School of Engineering and Digital Sciences
dc.identifier.urihttps://nur.nu.edu.kz/handle/123456789/18782
dc.language.isoen
dc.publisherNazarbayev University School of Engineering and Digital Sciences
dc.rightsAttribution 3.0 United Statesen
dc.rights.urihttp://creativecommons.org/licenses/by/3.0/us/
dc.subjectsign language recognition
dc.subjectgesture retrieval
dc.subjectpose estimation
dc.subjectTransformer
dc.subjectST-GCN
dc.subjectfew-shot learning
dc.subjectembedding space
dc.subjectapproximate nearest neighbor search
dc.subjectWLASL
dc.subjectPQDT_Master
dc.titleA Video-Based Reverse Dictionary for Sign Language Using Gesture Similarity
dc.typeMaster`s thesis

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
A Video-Based Reverse Dictionary for Sign Language Using Gesture Similarity.pdf
Size:
2.24 MB
Format:
Adobe Portable Document Format
Description:
Master`s thesis