A Video-Based Reverse Dictionary for Sign Language Using Gesture Similarity

Loading...
Thumbnail Image

Journal Title

Journal ISSN

Volume Title

Publisher

Nazarbayev University School of Engineering and Digital Sciences

Abstract

Most sign language recognition systems have been developed as classification models that associate gesture videos with pre-defined glosses, but such systems do not facilitate similarity search, where users can make queries without knowing the labels of gestures. This thesis proposes a sign language retrieval system based on pose representation that functions as a reverse gesture dictionary, enabling users to retrieve visually similar gestures directly from video input. The proposed method converts gestures into normalized skeletal joints rather than RGB images to minimize variations in appearance, such as background, lighting, and clothing, and to focus on dynamic motion patterns. The extracted keypoints are temporally normalized and optionally augmented with motion features to better capture gesture dynamics. In order to model the temporal relationships within the data, two models are considered; one being a Transformer model with a self-attention mechanism and another one being a Spatial-Temporal Graph Convolutional Network (ST-GCN). Both of these can be used to compare the capabilities of sequence models in modeling temporal dependencies. The model was evaluated using the WLASL dataset under the few-shot setting and ranking metrics like Recall@K and mean Average Precision (mAP), rather than using classification accuracy as it better suits a retrieval task. According to experimental results, it can be concluded that the Transformer model performs better when it comes to modeling temporal relationships between frames in gesture sequences compared to graph-based models. Additionally, employing attention-driven pooling during temporal aggregation improves the results significantly and achieves an mAP of 0.237 on the validation set. Transferability of the embedding space to novel gestures is tested by applying the trained model to the AUTSL dataset (using only a subset of 226 labels). Finally, the impact of approximate nearest neighbor search on retrieval results is examined.

Description

Citation

Orazumbekov, Batyrbek. (2026). A Video-Based Reverse Dictionary for Sign Language Using Gesture Similarity. Nazarbayev University School of Engineering and Digital Sciences

Endorsement

Review

Supplemented By

Referenced By

Creative Commons license

Except where otherwised noted, this item's license is described as Attribution 3.0 United States