Zhiyenbayev, Adil2024-06-232024-06-232024-04-28Zhiyenbayev, A. (2024). Enhancing Ambient Assisted Living: Multi-Modal Vision and Language Models for Real-Time Emergency Response. Nazarbayev University School of Engineering and Digital Scienceshttp://nur.nu.edu.kz/handle/123456789/7973The global demographic forecast predicts a surge to over 1.9 billion individuals by 2050, escalating the demand for efficient healthcare delivery, particularly for the elderly and disabled, who frequently require caregiving due to prevalent mental and physical health issues. This demographic trend underscores the critical need for robust long-term care services and continuous monitoring systems. However, the efficacy of these solutions is often compromised by caregiver overload, financial constraints, and logistical challenges in transportation, necessitating advanced technological interventions. In response, researchers have been refining ambient assisted living (AAL) environments through the integration of human activity recognition (HAR) utilizing advanced machine learning (ML) and deep learning (DL) techniques. These methods aim to reduce emergency incidents and enhance early detection and intervention. Traditional sensor-based HAR systems, despite their utility, suffer from significant limitations, including high data variability, environmental interference, and contextual inadequacies. To address these issues, vision language models (VLMs) enhance detection accuracy by interpreting scene contexts via caption generation, visual question answering (VQA), commonsense reasoning, and action recognition. However, VLMs face challenges in real-time application scenarios due to language ambiguity and occlusions, which can degrade the detection accuracy. Large language models (LLMs) combined with text-to-speech (TTS) and speech-to-text (STT) technologies can facilitate direct communication with the individual and enable real-time interactive assessments of a situation. Integrating real-time conversational capabilities via LLM, TTS, and STT into VLM framework significantly improves the detection of abnormal behavior by leveraging a comprehensive scene understanding and direct patient feedback, thus enhancing the system's reliability. A qualitative evaluation showed high system usability results in a subjective questionnaire during real-time experiments with participants. A quantitative evaluation of the developed system demonstrated high performance, achieving detection accuracy and recall rates of 93.44\% and 95\%, respectively, and a specificity rate of 88.88\% in various emergency scenarios before interaction. After the interaction stage, the performance was boosted to 100\% accuracy due to increased context from user's responses. Furthermore, the system not only effectively identifies emergencies but also provides contextual summaries and actionable recommendations to caregivers and patients. The research introduces a multimodal framework that combines VLMs, LLMs, TTS, and STT for real-time abnormal behavior detection and assistance. This study aims to develop a comprehensive framework that overcomes traditional HAR and AAL limitations by integrating instructions-driven VLM, LLM, human detection, TTS, and STT modules to enhance emergency response efficiency in home environments. This innovative approach promises substantial advancements in the field of AAL by providing timely and context-aware detection and response in emergencies.enAttribution-NonCommercial-NoDerivs 3.0 United StatesType of access: RestrictedAmbient assisted livingHuman activity recognitionVision-language modelsLarge language modelsSpeech modelsPrompt engineeringENHANCING AMBIENT ASSISTED LIVING WITH MULTI-MODAL VISION AND LANGUAGE MODELS: A NOVEL APPROACH FOR REAL-TIME ABNORMAL BEHAVIOR DETECTION AND EMERGENCY RESPONSEMaster's thesis