DSpace Repository

ENHANCING EMERGENCY RESPONSE: THE ROLE OF INTEGRATED VISION-LANGUAGE MODELS IN IN-HOME HEALTHCARE AND EFFICIENT MULTIMEDIA RETRIEVAL

Система будет остановлена для регулярного обслуживания. Пожалуйста, сохраните рабочие данные и выйдите из системы.

Show simple item record

dc.contributor.author Abdrakhmanov, Rakhat
dc.date.accessioned 2024-07-09T05:40:02Z
dc.date.available 2024-07-09T05:40:02Z
dc.date.issued 2024-06
dc.identifier.citation Abdrakhmanov, R. (2024). Enhancing Emergency Response: The Role of Integrated Vision-Language Models in In-Home Healthcare and Efficient Multimedia Retrieval. Nazarbayev University School of Engineering and Digital Sciences en_US
dc.identifier.uri http://nur.nu.edu.kz/handle/123456789/8096
dc.description.abstract Incidents of in-home injuries and sudden critical health conditions are relatively common and necessitate swift medical expertise. This study introduces an innovative use of vision-language models (VLMs) to elevate human healthcare through improved emergency recognition and efficient multimedia search capabilities. By harnessing the combined strengths of large language models (LLMs) and vision transformers (ViTs), this study enhances the analysis of both visual and textual information. We propose a framework that utilizes the PrismerZ VLM in both its Base and Large forms, along with a key frame selection (KFS) algorithm, to pinpoint and exam- ine pertinent images within video streams. This allows for the creation of enriched datasets, filled with images that are paired with descriptive narratives and insights gained from visual question answering (VQA). Through the integration of the CLIP- ViT-L-14 model and the MongoDB Atlas cloud database, we developed a multimodal retrieval system that achieves complex query handling and improved user experience. Additionally, this research undertakes data collection to assess the system’s adaptabil- ity, providing proof of concept and refining the framework. The results showcase the system’s robustness, evidenced by high accuracy rates—86.5% in image captioning and 92.5% in VQA tasks—on the kinetics dataset. When tested with human subject data, the PrismerZ Large model achieved 85.8% accuracy in image captioning and 87.5% in VQA tasks. This performance was further enhanced through fine-tuning with the GPT-4 based Chat GPT, one of the largest language assistants, leading to a 20% improvement in semantic text similarity as measured by the BERT model. The PrismerZ models also stand out for their speed, with the Base and Large versions processing image captioning and VQA tasks in just seconds, even on the NVidia Jet- son Orin NX edge device. These findings confirm the system’s reliability in real-life scenarios. The multimodal retrieval system achieved top performance with a mean average precision at k (MAP@k) of 93% and mean reciprocal rank (MRR) of 94.79% on the kinetics dataset, maintaining an average search latency of merely 0.33 seconds for text queries. This research significantly propels the fields of human activity recognition (HAR) and emergency detection forward, carving out new paths for anomaly detection and enriched multimedia understanding. Our objective in integrating the VLM with multimedia information retrieval is to establish new benchmarks for hu- man care, improving its timeliness, comprehensiveness, and efficiency in accessing multimedia data en_US
dc.language.iso en en_US
dc.publisher Nazarbayev University School of Engineering and Digital Sciences en_US
dc.rights Attribution-NonCommercial-ShareAlike 3.0 United States *
dc.rights.uri http://creativecommons.org/licenses/by-nc-sa/3.0/us/ *
dc.subject Type of access: Restricted en_US
dc.title ENHANCING EMERGENCY RESPONSE: THE ROLE OF INTEGRATED VISION-LANGUAGE MODELS IN IN-HOME HEALTHCARE AND EFFICIENT MULTIMEDIA RETRIEVAL en_US
dc.type Master's thesis en_US
workflow.import.source science


Files in this item

The following license files are associated with this item:

This item appears in the following Collection(s)

Show simple item record

Attribution-NonCommercial-ShareAlike 3.0 United States Except where otherwise noted, this item's license is described as Attribution-NonCommercial-ShareAlike 3.0 United States