๐-VIT: HUMAN ACTIVITY RECOGNITION USING AUXILIARY TASKS-ENHANCED VIDEO TRANSFORMERS
Loading...
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Nazarbayev University School of Engineering and Digital Sciences
Abstract
Human Activity Recognition (HAR) is a critical task in healthcare, enabling the goal of emergency detection and prevention without human supervision by employing IoT devices and machine learning techniques. While traditional unimodal approaches to HAR often fall short in accurately recognizing complex or subtle activities, multimodal systems integrating data from sensors such as accelerometers, gyroscopes, video, and audio provide richer context and higher accuracy. This work introduces Pose- and Sensor-Induced Video Transformer (-ViT) framework that enhances HAR performance by inducing motion sensor data through auxiliary learning tasks during training, while maintaining vision-only inference efficiency. Building on the principles of the Pose Induced Video Transformer (-ViT), our methodology extends auxiliary task learning to gyroscope and accelerometer modalities by introducing induction modules. Experiments demonstrate that combining these modules with a video transformer backbone improves recognition of fine-grained human activities by up to 7%, particularly for subtle motions, thus advancing HAR systems toward practical healthcare deployment without requiring wearable sensors during real-world use
Description
Keywords
Citation
Kirillov, K. (2025). ๐-ViT: Human Activity Recognition using Auxiliary Tasks-Enhanced Video Transformers. Nazarbayev University School of Engineering and Digital Sciences
Collections
Endorsement
Review
Supplemented By
Referenced By
Creative Commons license
Except where otherwised noted, this item's license is described as Attribution 3.0 United States
