ENHANCING EMERGENCY RESPONSE: THE ROLE OF INTEGRATED VISION-LANGUAGE MODELS IN IN-HOME HEALTHCARE AND EFFICIENT MULTIMEDIA RETRIEVAL
Loading...
Date
2024-06
Authors
Abdrakhmanov, Rakhat
Journal Title
Journal ISSN
Volume Title
Publisher
Nazarbayev University School of Engineering and Digital Sciences
Abstract
Incidents of in-home injuries and sudden critical health conditions are relatively common
and necessitate swift medical expertise. This study introduces an innovative use of
vision-language models (VLMs) to elevate human healthcare through improved
emergency recognition and efficient multimedia search capabilities. By harnessing the
combined strengths of large language models (LLMs) and vision transformers (ViTs), this
study enhances the analysis of both visual and textual information. We propose a
framework that utilizes the PrismerZ VLM in both its Base and Large forms, along with a
key frame selection (KFS) algorithm, to pinpoint and exam- ine pertinent images within
video streams. This allows for the creation of enriched datasets, filled with images that
are paired with descriptive narratives and insights gained from visual question answering
(VQA). Through the integration of the CLIP- ViT-L-14 model and the MongoDB Atlas
cloud database, we developed a multimodal retrieval system that achieves complex query
handling and improved user experience. Additionally, this research undertakes data
collection to assess the system’s adaptabil- ity, providing proof of concept and refining
the framework. The results showcase the system’s robustness, evidenced by high
accuracy rates—86.5% in image captioning and 92.5% in VQA tasks—on the kinetics
dataset. When tested with human subject data, the PrismerZ Large model achieved 85.8%
accuracy in image captioning and 87.5% in VQA tasks. This performance was further
enhanced through fine-tuning with the GPT-4 based Chat GPT, one of the largest
language assistants, leading to a 20% improvement in semantic text similarity as
measured by the BERT model. The PrismerZ models also stand out for their speed, with
the Base and Large versions processing image captioning and VQA tasks in just seconds,
even on the NVidia Jet- son Orin NX edge device. These findings confirm the system’s
reliability in real-life scenarios. The multimodal retrieval system achieved top
performance with a mean average precision at k (MAP@k) of 93% and mean reciprocal
rank (MRR) of 94.79% on the kinetics dataset, maintaining an average search latency of
merely 0.33 seconds
for text queries. This research significantly propels the fields of human activity recognition
(HAR) and emergency detection forward, carving out new paths for anomaly
detection and enriched multimedia understanding. Our objective in integrating the VLM
with multimedia information retrieval is to establish new benchmarks for hu- man care,
improving its timeliness, comprehensiveness, and efficiency in accessing multimedia
data
Description
Keywords
Type of access: Restricted
Citation
Abdrakhmanov, R. (2024). Enhancing Emergency Response: The Role of Integrated Vision-Language Models in In-Home Healthcare and Efficient Multimedia Retrieval. Nazarbayev University School of Engineering and Digital Sciences