Master’s Thesis – Pose2Act: Transformer-based 3D Pose Estimation and Graph Convolution Networks for Human Activity Recognition
by Dias Aimyshev
NU SEDS CS 2nd year MS student
Thesis Supervisor: Adnan Yazici 


Outline
Introduction
Main part
Related Works
Methodology
Results
Demo
Application
Conclusion and Future Work


Pose Estimation – predict coordinates of human body joints

Human Activity Recognition (HAR) – classify human activity
Introduction


Brief History


Deep Learning: CNN, RNN, LSTM, Autoencoders, TCN, GCN, Transformers, 3D CNN  


Pose Estimation and HAR are often viewed separately
Skeleton-based HAR input is 3D Pose Estimation output
Most SOTA HAR models use data generated by motion capture systems 


Problem Statement


Limited application, hardware dependence
Low performance on 2D data
Skeleton data is more robust than RGB images


Motivation

Objectives:
Combine 3D Pose Estimation and HAR models into End2End system
Achieve comparable accuracy scores
Build a lightweight model for feasible real-life application


Created End2End model called Pose2Act, that includes 3D Pose Estimation and HAR models.
Proposed a method to overlap frames to create input for HAR model of suitable size.
Outperformed models running on 2D projections of NTU-RGB+D dataset using cross-subject metric.
Approached the results of the SOTA HAR models running on generated 3D data.


Contribution


Transformer, GCN, LSTM, CNN
Sequence-to-frame (27,81,243)
Spatio-temporal modules
Indirect approach: convert 2D to 3D


Literature (3D Pose)


GCN+TCN, 3D CNN, Transformer, LSTM
2, 4, 5, 6-way ensembling
Spatio-temporal modules
Attention modules
Adaptive graph construction


Literature (HAR)


For Pose Estimation:
MPI-INF-3DHP - 1.3 million frames, 8 activity sets, 2D and 3D coordinates and videos, 17 body joints
Human 3.6M - 3.6 million poses, 17 scenarios, 2D and 3D coordinates and videos, 24 body joints


For HAR:
NTU RGB+D - 60 action classes; 56,880 videos; RGB videos and 3D skeletal data; 25 body joints
Kinetics - 700 video classes; 650,000 video clips; no skeleton data

Datasets


Datasets (cont.)


Sequential approach
Use overlapping sequence of 27 frames
Predict 3D coordinates of central frames
Predict activity based on predicted 3D skeletons
Methodology: End2End


Methodology: 3D Pose


Transformer


Methodology: HAR

GCN + TCN
6-way ensembling
3 centres of mass
Joints and bones
Hierarchically Decomposed graph


Methodology: HAR (cont.)


GCN Module


Results

Normalized MPJPE scores for the proposed 3D Pose Estimation model, calculated by formula:


Results


Results


Results

Reading, writing, playing with the phone, typing on a keyboard
Average accuracy score 90.3%
Normalized MPJPE 0.044mm


Demo


RGB Video – 2D Skeleton – 3D Skeleton – Action


Healthcare – combination with sensor-based HAR
Surveillance – alternative to video-based HAR
Animation – create dataset with 3D models to generate virtual content
Other possible areas:
Robotics
Motion capture
Motion prediction
AR and VR

Real-life Application


Advantages:
Outperforms 2D-based and predicted 3D-based models
Approaches generated 3D-based state-of-the-art results
Wider range of application than for generated 3D
More robust than video-based approach


Lower scores than for generated 3D models
Lower scores for hand activities
Combination with 2D keypoint detector is needed
Sequence-to-frame bottleneck


Conclusion
Disadvantages:


Sequence-to-Sequence 3D Pose Estimation
2D keypoints detection
7, 9-way ensembling
Parallel End2End processing
Future Work


Thank you for attention!


References
[1] Shaojie Bai, J Zico Kolter, and Vladlen Koltun. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271, 2018. 
2] Marius Bock, Michael Moeller, Kristof Van Laerhoven, and Hilde Kuehne. Wear: A multimodal dataset for wearable and egocentric video activity recognition. arXiv preprint arXiv:2304.05088, 2023.
[3] Damien Bouchabou, Sao Mai Nguyen, Christophe Lohr, Benoit LeDuc, and Ioannis Kanellos. A survey of human activity recognition in smart homes based on iot sensors algorithms: Taxonomies, challenges, and opportunities with deep learning. Sensors, 21(18):6037, 2021.
[4] Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. Realtime multi-person 2d pose estimation using part affinity fields. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7291–7299, 2017.
[5] Yuxin Chen, Ziqi Zhang, Chunfeng Yuan, Bing Li, Ying Deng, and Weiming Hu. Channel-wise topology refinement graph convolution for skeleton-based action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 13359–13368, 2021.
[6] Ke Cheng, Yifan Zhang, Congqi Cao, Lei Shi, Jian Cheng, and Hanqing Lu. Decoupling gcn with dropgraph module for skeleton-based action recognition. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXIV 16, pages 536–553. Springer, 2020.
[7] Hyung-gun Chi, Myoung Hoon Ha, Seunggeun Chi, Sang Wan Lee, Qixing Huang, and Karthik Ramani. Infogcn: Representation learning for human skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20186–20196, 2022.
[8] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
[9] Haodong Duan, Yue Zhao, Kai Chen, Dahua Lin, and Bo Dai. Revisiting skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2969–2978, 2022.
[10] Mariem Gnouma, Ammar Ladjailia, Ridha Ejbali, and Mourad Zaied. Stacked sparse autoencoder and history of binary motion image for human activity recognition. Multimedia Tools and Applications, 78(2):2157–2179, 2019.
[11] Mohammed Hassanin, Abdelwahed Khamiss, Mohammed Bennamoun, Farid Boussaid, and Ibrahim Radwan. Crossformer: Cross spatio-temporal transformer for 3d human pose estimation. arXiv preprint arXiv:2203.13387, 2022.
[12] Yihui He, Rui Yan, Katerina Fragkiadaki, and Shoou-I Yu. Epipolar transformers. In Proceedings of the ieee/cvf conference on computer vision and pattern recognition, pages 7779–7788, 2020.
[13] Wenbo Hu, Changgong Zhang, Fangneng Zhan, Lei Zhang, and Tien-Tsin Wong. Conditional directed graph convolution for 3d human pose estimation. In Pro ceedings of the 29th ACM International Conference on Multimedia, pages 602–611, 2021.
[14] Catalin Ionescu, Dragos Papava, Vlad Olaru, and Cristian Sminchisescu. Human3.6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(7):1325–1339, jul 2014.


References
[15] Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016.
[16] Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942, 2019.
[17] Inwoong Lee, Doyoung Kim, Seoungyoon Kang, and Sanghoon Lee. Ensemble deep learning for skeleton-based action recognition using temporal sliding lstm networks. In Proceedings of the IEEE international conference on computer vision, pages 1012–1020, 2017.
[18] Jaejun Lee, Raphael Tang, and Jimmy Lin. What would elsa do? freezing layers during transformer fine-tuning. arXiv preprint arXiv:1911.03090, 2019.
[19] Jungho Lee, Minhyeok Lee, Dogyoon Lee, and Sangyoon Lee. Hierarchically decomposed graph convolutional networks for skeleton-based action recognition. arXiv preprint arXiv:2208.10741, 2022.
[20] Wenhao Li, Hong Liu, Runwei Ding, Mengyuan Liu, Pichao Wang, and Wenming Yang. Exploiting temporal contexts with strided transformer for 3d human pose estimation. IEEE Transactions on Multimedia, 2022.
[21] Mengyuan Liu, Hong Liu, and Chen Chen. Enhanced skeleton visualization for view invariant human action recognition. Pattern Recognition, 68:346–362, 2017.
[22] Ziyu Liu, Hongwen Zhang, Zhenghao Chen, Zhiyong Wang, and Wanli Ouyang.Disentangling and unifying graph convolutions for skeleton-based action recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 143–152, 2020.
[23] Diogo C Luvizon, David Picard, and Hedi Tabia. Multi-task deep learning for real-time 3d human pose estimation and action recognition. IEEE transactions on pattern analysis and machine intelligence, 43(8):2752–2764, 2020.
[24] Haoyu Ma, Liangjian Chen, Deying Kong, Zhe Wang, Xingwei Liu, Hao Tang, Xiangyi Yan, Yusheng Xie, Shih-Yao Lin, and Xiaohui Xie. Transfusion: Crossview fusion with transformer for 3d human pose estimation. arXiv preprint arXiv:2110.09554, 2021.
[25] Dushyant Mehta, Helge Rhodin, Dan Casas, Pascal Fua, Oleksandr Sotnychenko, Weipeng Xu, and Christian Theobalt. Monocular 3d human pose estimation in the wild using improved cnn supervision. In 3D Vision (3DV), 2017 Fifth International Conference on. IEEE, 2017.
[26] Ohoud Nafea, Wadood Abdul, Ghulam Muhammad, and Mansour Alsulaiman. Sensor-based human activity recognition with spatio-temporal deep learning. Sensors, 21(6):2141, 2021.
[27] Mirela Ostrek, Helge Rhodin, Pascal Fua, Erich Müller, and Jörg Spörri. Are existing monocular computer vision-based 3d motion capture approaches ready for deployment? a methodological study on the example of alpine skiing. Sensors, 19(19):4323, 2019.
[28] Sen Qiu, Hongkai Zhao, Nan Jiang, Zhelong Wang, Long Liu, Yi An, Hongyu Zhao, Xin Miao, Ruichen Liu, and Giancarlo Fortino. Multi-sensor information fusion based on machine learning for real applications in human activity recognition: State-of-the-art and research challenges. Information Fusion, 80:241–265,2022.
[29] Amir Shahroudy, Jun Liu, Tian-Tsong Ng, and Gang Wang. Ntu rgb+ d: A large scale dataset for 3d human activity analysis. In Proceedings of the IEEEconference on computer vision and pattern recognition, pages 1010–1019, 2016.
[30] Vijeta Sharma, Manjari Gupta, Anil Kumar Pandey, Deepti Mishra, and Ajai Kumar. A review of deep learning-based human activity recognition on benchmark video datasets. Applied Artificial Intelligence, 36(1):2093705, 2022.


References
[31] Wenkang Shan, Zhenhua Liu, Xinfeng Zhang, Shanshe Wang, Siwei Ma, and Wen Gao. P-stmo: Pre-trained spatial temporal many-to-one model for 3d human pose estimation. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part V, pages 461–478. Springer, 2022.
[32] Lei Shi, Yifan Zhang, Jian Cheng, and Hanqing Lu. Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12026–12035, 2019.
[33] Martin Simonovsky and Nikos Komodakis. Dynamic edge-conditioned filters in convolutional neural networks on graphs. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3693–3702, 2017.
[34] Sijie Song, Cuiling Lan, Junliang Xing, Wenjun Zeng, and Jiaying Liu. An end-to-end spatio-temporal attention model for human action recognition from skeleton data. In Proceedings of the AAAI conference on artificial intelligence, volume 31, 2017.
[35] Ke Sun, Bin Xiao, Dong Liu, and Jingdong Wang. Deep high-resolution representation learning for human pose estimation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5693–5703, 2019.
[36] Arslan Syed, Eman A Aldhahri, Muhammad Munawar Iqbal, Abid Ali, Ammar Muthanna, Harun Jamil, and Faisal Jamil. Intelligent 3d network protocol for multimedia data classification using deep learning. arXiv preprint arXiv:2207.11504, 2022.
[37] Chingizkhan Tangirbergenov, Adnan Yazici, and Enver Ever. Multi-stream orientation and position based adaptive graph convolutional network for skeleton based activity recognition. 2021.
[38] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
[39] Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. Graph attention networks. arXiv preprint arXiv:1710.10903, 2017.
[40] Bin Xiao, Haiping Wu, and Yichen Wei. Simple baselines for human pose estimation and tracking. In Proceedings of the European conference on computer vision (ECCV), pages 466–481, 2018.
[41] Sijie Yan, Yuanjun Xiong, and Dahua Lin. Spatial temporal graph convolutional networks for skeleton-based action recognition. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018.
[42] Xinwei Yu. (fusionformer): Exploiting the joint motion synergy with fusion network based on transformer for 3d human pose estimation. arXiv preprint arXiv:2210.04006, 2022.
[43] Ailing Zeng, Xiao Sun, Lei Yang, Nanxuan Zhao, Minhao Liu, and Qiang Xu. Learning skeletal graph neural networks for hard 3d pose estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11436–11445, 2021.
[44] Jinlu Zhang, Zhigang Tu, Jianyu Yang, Yujin Chen, and Junsong Yuan. Mixste: Seq2seq mixed spatio-temporal encoder for 3d human pose estimation in video.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13232–13242, 2022.
[45] Ce Zheng, Sijie Zhu, Matias Mendieta, Taojiannan Yang, Chen Chen, and Zhengming Ding. 3d human pose estimation with spatial and temporal transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision
(ICCV), pages 11656–11665, October 2021.


image1.png

image2.png

image5.jpg

image6.png

image7.png

image3.png

image8.png

image9.png

image10.png

image11.png

image12.png

image13.png

image14.png

image15.png

image16.png

image17.png

image18.png

image19.png

image20.png

image21.png

image22.png

image23.png

image24.svg
                                                                                                                                                                                                                                                                  

image25.png

image26.png

image27.png

image28.png

image270.png

image29.png

image30.png

image31.png

image32.gif

image33.gif

image34.gif

image35.png