Abstract:
Skeletal-based action detection is gaining importance in the realm of human activity analysis because it provides a foundational understanding for applications ranging from surveillance to human-computer interaction. In light of the difficulties caused by the dearth of labeled datasets in this area, our research emphasizes the vital significance of self-supervised learning, emphasizing contrastive learning strategies in particular.By advocating for pretraining models with unlabeled data, we demon- strate how self-supervised learning can significantly enhance the model’s ability to identify human actions without the requirement for extensive annotated datasets. In contrastive learning, the encoder is essential because it converts raw input into repre- sentations that can discriminate between action sequences that are similar (positive) and different (negative). Our novel method incorporates Transformers into the MoCo (Momentum Contrast) contrastive learning framework as encoder. In addition to de- parting from conventional supervised learning techniques, this combination makes use of Transformers’ advantages to more effectively capture the intricate spatial-temporal correlations present in skeletal data. By including Transformer-based encoders into the MoCo framework, skeleton-based action detection has advanced significantly and new efficiency and accuracy benchmarks have been reached. Our results show the sig- nificant advantages of integrating sophisticated encoding methods with self-supervised learning, opening the door to more complex and less data-dependent assessments of human behavior.