SURVEILLANCE VIDEO-TO-TEXT GENERATOR

Zhailaubayeva, Kamilya

NUR Home
→
01.NU Schools
→
School of Engineering and Digital Sciences
→
Theses and Dissertations
→
View Item

dc.contributor.author	Zhailaubayeva, Kamilya
dc.date.accessioned	2024-05-19T14:18:59Z
dc.date.available	2024-05-19T14:18:59Z
dc.date.issued	2024-04-27
dc.identifier.citation	Zhailaubayeva, Kamilya. (2024) Surveillance Video-to-Text Generator. Nazarbayev University School of Engineering and Digital Sciences.	en_US
dc.identifier.uri	http://nur.nu.edu.kz/handle/123456789/7686
dc.description.abstract	Video-to-text generation is a relatively new field which is recently gaining a popularity as other generative models. It is particularly beneficial for surveillance to retrieve textual descriptions from CCTV cameras. Thus, this thesis aims to design an efficient video-to-text generator for surveillance. The research established that encoder-decoder neural networks are the most up-to-date technique. Specifically, pretrained CNN models and the LSTM based encoder and decoder models are the most prominent neural networks. Though the research was initially intended for surveillance videos, the largest MSR-VTT and MSVD datasets with general videos were used for training due to the absence of available surveillance dataset with captions. The models were designed in four parts: feature extraction, model training, test caption generation, and model evaluation. Video features, which were extracted using pretrained VGG16 CNN, were fed into the encoder using one LSTM layer. Then, the decoder in the form of another LSTM layer was implemented for video caption generation. In general, 12 models were trained for two datasets with various number of frames per video and vocabulary size. The best performing model, which was trained on MSVD dataset with 16 frames per video and 2000 vocabulary size, scored 12.8, 32.2, 32.9, and 44.0 on METEOR, BLEU1, ROUGE, and CIDEr evaluation metrics respectively. Therefore, MSVD dataset is the most suitable for the designed architecture. Furthermore, it was found that increasing number of frames per video was not legitimate in terms of computational resources for short videos between 10 to 30 seconds. Finally, 2000 vocabulary size is an excellent size for MSVD dataset. Though the proposed model generated captions, it performed worse than the past research in terms of evaluation metrics. This might be caused by the computational limitation, inaccurate caption datasets, and improper selection of the search algorithm.	en_US
dc.language.iso	en	en_US
dc.publisher	Nazarbayev University School of Engineering and Digital Sciences	en_US
dc.rights	Attribution-NonCommercial 3.0 United States	*
dc.rights.uri	http://creativecommons.org/licenses/by-nc/3.0/us/	*
dc.subject	Type of access: Restricted	en_US
dc.subject	Video-to-text generation	en_US
dc.subject	CNN	en_US
dc.subject	LSTM	en_US
dc.subject	NLP	en_US
dc.title	SURVEILLANCE VIDEO-TO-TEXT GENERATOR	en_US
dc.type	Master's thesis	en_US
workflow.import.source	science