VISION-LANGUAGE MODELS ON THE EDGE: AN ASSISTIVE TECHNOLOGY FOR THE VISUALLY IMPAIRED
Loading...
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Nazarbayev University School of Engineering and Digital Sciences
Abstract
Vision-Language Models, or VLMs, are deep learning models at the intersection of Computer Vision and Natural Language Processing. They effectively combine image understanding and language generation capabilities and are widely used in various as sistive tasks today. Nevertheless, the application of VLMs to assist visually impaired and blind people remains an underexplored area in the field. Existing approaches to developing assistive technology for the visually impaired have a substantial limitation: the computation is usually performed on the cloud, which makes the systems heavily dependent on an internet connection and the state of the remote server. This makes the systems unreliable, which limits their practical usage in everyday tasks. In our work, to address the issues of the previous approaches, we propose utilizing VLMs on embedded systems, ensuring real-time efficiency and autonomy of the assistive module. We present an end-to-end workflow for developing the system, extensively covering hardware and software architecture and integration with speech recogni tion and text-to-speech technologies. The developed system possesses comprehensive scene interpretation and user navigation capabilities necessary for visually impaired individuals to enhance their day-to-day activities. Moreover, we confirm the prac tical application of the wearable assistive module by conducting experiments with actual human participants and provide subjective as well as objective results from the system’s assessment.
Description
Citation
Arystanbekov, B. (2024). Vision-Language Models on the Edge: An Assistive Technology for the Visually Impaired. Nazarbayev University School of Engineering and Digital Sciences
Collections
Endorsement
Review
Supplemented By
Referenced By
Creative Commons license
Except where otherwised noted, this item's license is described as Attribution-NonCommercial-ShareAlike 3.0 United States
