dc.contributor.author | Zhailaubayeva, Kamilya![]() |
|
dc.date.accessioned | 2024-05-19T14:18:59Z | |
dc.date.available | 2024-05-19T14:18:59Z | |
dc.date.issued | 2024-04-27 | |
dc.identifier.citation | Zhailaubayeva, Kamilya. (2024) Surveillance Video-to-Text Generator. Nazarbayev University School of Engineering and Digital Sciences. | en_US |
dc.identifier.uri | http://nur.nu.edu.kz/handle/123456789/7686 | |
dc.description.abstract | Video-to-text generation is a relatively new field which is recently gaining a popularity as other generative models. It is particularly beneficial for surveillance to retrieve textual descriptions from CCTV cameras. Thus, this thesis aims to design an efficient video-to-text generator for surveillance. The research established that encoder-decoder neural networks are the most up-to-date technique. Specifically, pretrained CNN models and the LSTM based encoder and decoder models are the most prominent neural networks. Though the research was initially intended for surveillance videos, the largest MSR-VTT and MSVD datasets with general videos were used for training due to the absence of available surveillance dataset with captions. The models were designed in four parts: feature extraction, model training, test caption generation, and model evaluation. Video features, which were extracted using pretrained VGG16 CNN, were fed into the encoder using one LSTM layer. Then, the decoder in the form of another LSTM layer was implemented for video caption generation. In general, 12 models were trained for two datasets with various number of frames per video and vocabulary size. The best performing model, which was trained on MSVD dataset with 16 frames per video and 2000 vocabulary size, scored 12.8, 32.2, 32.9, and 44.0 on METEOR, BLEU1, ROUGE, and CIDEr evaluation metrics respectively. Therefore, MSVD dataset is the most suitable for the designed architecture. Furthermore, it was found that increasing number of frames per video was not legitimate in terms of computational resources for short videos between 10 to 30 seconds. Finally, 2000 vocabulary size is an excellent size for MSVD dataset. Though the proposed model generated captions, it performed worse than the past research in terms of evaluation metrics. This might be caused by the computational limitation, inaccurate caption datasets, and improper selection of the search algorithm. | en_US |
dc.language.iso | en | en_US |
dc.publisher | Nazarbayev University School of Engineering and Digital Sciences | en_US |
dc.rights | Attribution-NonCommercial 3.0 United States | * |
dc.rights.uri | http://creativecommons.org/licenses/by-nc/3.0/us/ | * |
dc.subject | Type of access: Restricted | en_US |
dc.subject | Video-to-text generation | en_US |
dc.subject | CNN | en_US |
dc.subject | LSTM | en_US |
dc.subject | NLP | en_US |
dc.title | SURVEILLANCE VIDEO-TO-TEXT GENERATOR | en_US |
dc.type | Master's thesis | en_US |
workflow.import.source | science |
The following license files are associated with this item: