Abdrakhmanov, Rakhat2024-07-092024-07-092024-06Abdrakhmanov, R. (2024). Enhancing Emergency Response: The Role of Integrated Vision-Language Models in In-Home Healthcare and Efficient Multimedia Retrieval. Nazarbayev University School of Engineering and Digital Scienceshttp://nur.nu.edu.kz/handle/123456789/8096Incidents of in-home injuries and sudden critical health conditions are relatively common and necessitate swift medical expertise. This study introduces an innovative use of vision-language models (VLMs) to elevate human healthcare through improved emergency recognition and efficient multimedia search capabilities. By harnessing the combined strengths of large language models (LLMs) and vision transformers (ViTs), this study enhances the analysis of both visual and textual information. We propose a framework that utilizes the PrismerZ VLM in both its Base and Large forms, along with a key frame selection (KFS) algorithm, to pinpoint and exam- ine pertinent images within video streams. This allows for the creation of enriched datasets, filled with images that are paired with descriptive narratives and insights gained from visual question answering (VQA). Through the integration of the CLIP- ViT-L-14 model and the MongoDB Atlas cloud database, we developed a multimodal retrieval system that achieves complex query handling and improved user experience. Additionally, this research undertakes data collection to assess the system’s adaptabil- ity, providing proof of concept and refining the framework. The results showcase the system’s robustness, evidenced by high accuracy rates—86.5% in image captioning and 92.5% in VQA tasks—on the kinetics dataset. When tested with human subject data, the PrismerZ Large model achieved 85.8% accuracy in image captioning and 87.5% in VQA tasks. This performance was further enhanced through fine-tuning with the GPT-4 based Chat GPT, one of the largest language assistants, leading to a 20% improvement in semantic text similarity as measured by the BERT model. The PrismerZ models also stand out for their speed, with the Base and Large versions processing image captioning and VQA tasks in just seconds, even on the NVidia Jet- son Orin NX edge device. These findings confirm the system’s reliability in real-life scenarios. The multimodal retrieval system achieved top performance with a mean average precision at k (MAP@k) of 93% and mean reciprocal rank (MRR) of 94.79% on the kinetics dataset, maintaining an average search latency of merely 0.33 seconds for text queries. This research significantly propels the fields of human activity recognition (HAR) and emergency detection forward, carving out new paths for anomaly detection and enriched multimedia understanding. Our objective in integrating the VLM with multimedia information retrieval is to establish new benchmarks for hu- man care, improving its timeliness, comprehensiveness, and efficiency in accessing multimedia dataenAttribution-NonCommercial-ShareAlike 3.0 United StatesType of access: RestrictedENHANCING EMERGENCY RESPONSE: THE ROLE OF INTEGRATED VISION-LANGUAGE MODELS IN IN-HOME HEALTHCARE AND EFFICIENT MULTIMEDIA RETRIEVALMaster's thesis