DSpace Repository

SURVEILLANCE VIDEO-TO-TEXT GENERATOR

Show simple item record

dc.contributor.author Zhailaubayeva, Kamilya
dc.date.accessioned 2024-05-19T14:18:59Z
dc.date.available 2024-05-19T14:18:59Z
dc.date.issued 2024-04-27
dc.identifier.citation Zhailaubayeva, Kamilya. (2024) Surveillance Video-to-Text Generator. Nazarbayev University School of Engineering and Digital Sciences. en_US
dc.identifier.uri http://nur.nu.edu.kz/handle/123456789/7686
dc.description.abstract Video-to-text generation is a relatively new field which is recently gaining a popularity as other generative models. It is particularly beneficial for surveillance to retrieve textual descriptions from CCTV cameras. Thus, this thesis aims to design an efficient video-to-text generator for surveillance. The research established that encoder-decoder neural networks are the most up-to-date technique. Specifically, pretrained CNN models and the LSTM based encoder and decoder models are the most prominent neural networks. Though the research was initially intended for surveillance videos, the largest MSR-VTT and MSVD datasets with general videos were used for training due to the absence of available surveillance dataset with captions. The models were designed in four parts: feature extraction, model training, test caption generation, and model evaluation. Video features, which were extracted using pretrained VGG16 CNN, were fed into the encoder using one LSTM layer. Then, the decoder in the form of another LSTM layer was implemented for video caption generation. In general, 12 models were trained for two datasets with various number of frames per video and vocabulary size. The best performing model, which was trained on MSVD dataset with 16 frames per video and 2000 vocabulary size, scored 12.8, 32.2, 32.9, and 44.0 on METEOR, BLEU1, ROUGE, and CIDEr evaluation metrics respectively. Therefore, MSVD dataset is the most suitable for the designed architecture. Furthermore, it was found that increasing number of frames per video was not legitimate in terms of computational resources for short videos between 10 to 30 seconds. Finally, 2000 vocabulary size is an excellent size for MSVD dataset. Though the proposed model generated captions, it performed worse than the past research in terms of evaluation metrics. This might be caused by the computational limitation, inaccurate caption datasets, and improper selection of the search algorithm. en_US
dc.language.iso en en_US
dc.publisher Nazarbayev University School of Engineering and Digital Sciences en_US
dc.rights Attribution-NonCommercial 3.0 United States *
dc.rights.uri http://creativecommons.org/licenses/by-nc/3.0/us/ *
dc.subject Type of access: Restricted en_US
dc.subject Video-to-text generation en_US
dc.subject CNN en_US
dc.subject LSTM en_US
dc.subject NLP en_US
dc.title SURVEILLANCE VIDEO-TO-TEXT GENERATOR en_US
dc.type Master's thesis en_US
workflow.import.source science


Files in this item

The following license files are associated with this item:

This item appears in the following Collection(s)

Show simple item record

Attribution-NonCommercial 3.0 United States Except where otherwise noted, this item's license is described as Attribution-NonCommercial 3.0 United States