Master’s Thesis – Pose2Act: Transformer-based 3D Pose Estimation and Graph Convolution Networks for Human Activity Recognition by Dias Aimyshev NU SEDS CS 2nd year MS student Thesis Supervisor: Adnan Yazici Outline Introduction Main part Related Works Methodology Results Demo Application Conclusion and Future Work Pose Estimation – predict coordinates of human body joints Human Activity Recognition (HAR) – classify human activity Introduction Brief History Deep Learning: CNN, RNN, LSTM, Autoencoders, TCN, GCN, Transformers, 3D CNN Pose Estimation and HAR are often viewed separately Skeleton-based HAR input is 3D Pose Estimation output Most SOTA HAR models use data generated by motion capture systems Problem Statement Limited application, hardware dependence Low performance on 2D data Skeleton data is more robust than RGB images Motivation Objectives: Combine 3D Pose Estimation and HAR models into End2End system Achieve comparable accuracy scores Build a lightweight model for feasible real-life application Created End2End model called Pose2Act, that includes 3D Pose Estimation and HAR models. Proposed a method to overlap frames to create input for HAR model of suitable size. Outperformed models running on 2D projections of NTU-RGB+D dataset using cross-subject metric. Approached the results of the SOTA HAR models running on generated 3D data. Contribution Transformer, GCN, LSTM, CNN Sequence-to-frame (27,81,243) Spatio-temporal modules Indirect approach: convert 2D to 3D Literature (3D Pose) GCN+TCN, 3D CNN, Transformer, LSTM 2, 4, 5, 6-way ensembling Spatio-temporal modules Attention modules Adaptive graph construction Literature (HAR) For Pose Estimation: MPI-INF-3DHP - 1.3 million frames, 8 activity sets, 2D and 3D coordinates and videos, 17 body joints Human 3.6M - 3.6 million poses, 17 scenarios, 2D and 3D coordinates and videos, 24 body joints For HAR: NTU RGB+D - 60 action classes; 56,880 videos; RGB videos and 3D skeletal data; 25 body joints Kinetics - 700 video classes; 650,000 video clips; no skeleton data Datasets Datasets (cont.) Sequential approach Use overlapping sequence of 27 frames Predict 3D coordinates of central frames Predict activity based on predicted 3D skeletons Methodology: End2End Methodology: 3D Pose Transformer Methodology: HAR GCN + TCN 6-way ensembling 3 centres of mass Joints and bones Hierarchically Decomposed graph Methodology: HAR (cont.) GCN Module Results Normalized MPJPE scores for the proposed 3D Pose Estimation model, calculated by formula: Results Results Results Reading, writing, playing with the phone, typing on a keyboard Average accuracy score 90.3% Normalized MPJPE 0.044mm Demo RGB Video – 2D Skeleton – 3D Skeleton – Action Healthcare – combination with sensor-based HAR Surveillance – alternative to video-based HAR Animation – create dataset with 3D models to generate virtual content Other possible areas: Robotics Motion capture Motion prediction AR and VR Real-life Application Advantages: Outperforms 2D-based and predicted 3D-based models Approaches generated 3D-based state-of-the-art results Wider range of application than for generated 3D More robust than video-based approach Lower scores than for generated 3D models Lower scores for hand activities Combination with 2D keypoint detector is needed Sequence-to-frame bottleneck Conclusion Disadvantages: Sequence-to-Sequence 3D Pose Estimation 2D keypoints detection 7, 9-way ensembling Parallel End2End processing Future Work Thank you for attention! References [1] Shaojie Bai, J Zico Kolter, and Vladlen Koltun. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271, 2018. 2] Marius Bock, Michael Moeller, Kristof Van Laerhoven, and Hilde Kuehne. Wear: A multimodal dataset for wearable and egocentric video activity recognition. arXiv preprint arXiv:2304.05088, 2023. [3] Damien Bouchabou, Sao Mai Nguyen, Christophe Lohr, Benoit LeDuc, and Ioannis Kanellos. A survey of human activity recognition in smart homes based on iot sensors algorithms: Taxonomies, challenges, and opportunities with deep learning. Sensors, 21(18):6037, 2021. [4] Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. Realtime multi-person 2d pose estimation using part affinity fields. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7291–7299, 2017. [5] Yuxin Chen, Ziqi Zhang, Chunfeng Yuan, Bing Li, Ying Deng, and Weiming Hu. Channel-wise topology refinement graph convolution for skeleton-based action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 13359–13368, 2021. [6] Ke Cheng, Yifan Zhang, Congqi Cao, Lei Shi, Jian Cheng, and Hanqing Lu. Decoupling gcn with dropgraph module for skeleton-based action recognition. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXIV 16, pages 536–553. Springer, 2020. [7] Hyung-gun Chi, Myoung Hoon Ha, Seunggeun Chi, Sang Wan Lee, Qixing Huang, and Karthik Ramani. Infogcn: Representation learning for human skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20186–20196, 2022. [8] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020. [9] Haodong Duan, Yue Zhao, Kai Chen, Dahua Lin, and Bo Dai. Revisiting skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2969–2978, 2022. [10] Mariem Gnouma, Ammar Ladjailia, Ridha Ejbali, and Mourad Zaied. Stacked sparse autoencoder and history of binary motion image for human activity recognition. Multimedia Tools and Applications, 78(2):2157–2179, 2019. [11] Mohammed Hassanin, Abdelwahed Khamiss, Mohammed Bennamoun, Farid Boussaid, and Ibrahim Radwan. Crossformer: Cross spatio-temporal transformer for 3d human pose estimation. arXiv preprint arXiv:2203.13387, 2022. [12] Yihui He, Rui Yan, Katerina Fragkiadaki, and Shoou-I Yu. Epipolar transformers. In Proceedings of the ieee/cvf conference on computer vision and pattern recognition, pages 7779–7788, 2020. [13] Wenbo Hu, Changgong Zhang, Fangneng Zhan, Lei Zhang, and Tien-Tsin Wong. Conditional directed graph convolution for 3d human pose estimation. In Pro ceedings of the 29th ACM International Conference on Multimedia, pages 602–611, 2021. [14] Catalin Ionescu, Dragos Papava, Vlad Olaru, and Cristian Sminchisescu. Human3.6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(7):1325–1339, jul 2014. References [15] Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016. [16] Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942, 2019. [17] Inwoong Lee, Doyoung Kim, Seoungyoon Kang, and Sanghoon Lee. Ensemble deep learning for skeleton-based action recognition using temporal sliding lstm networks. In Proceedings of the IEEE international conference on computer vision, pages 1012–1020, 2017. [18] Jaejun Lee, Raphael Tang, and Jimmy Lin. What would elsa do? freezing layers during transformer fine-tuning. arXiv preprint arXiv:1911.03090, 2019. [19] Jungho Lee, Minhyeok Lee, Dogyoon Lee, and Sangyoon Lee. Hierarchically decomposed graph convolutional networks for skeleton-based action recognition. arXiv preprint arXiv:2208.10741, 2022. [20] Wenhao Li, Hong Liu, Runwei Ding, Mengyuan Liu, Pichao Wang, and Wenming Yang. Exploiting temporal contexts with strided transformer for 3d human pose estimation. IEEE Transactions on Multimedia, 2022. [21] Mengyuan Liu, Hong Liu, and Chen Chen. Enhanced skeleton visualization for view invariant human action recognition. Pattern Recognition, 68:346–362, 2017. [22] Ziyu Liu, Hongwen Zhang, Zhenghao Chen, Zhiyong Wang, and Wanli Ouyang.Disentangling and unifying graph convolutions for skeleton-based action recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 143–152, 2020. [23] Diogo C Luvizon, David Picard, and Hedi Tabia. Multi-task deep learning for real-time 3d human pose estimation and action recognition. IEEE transactions on pattern analysis and machine intelligence, 43(8):2752–2764, 2020. [24] Haoyu Ma, Liangjian Chen, Deying Kong, Zhe Wang, Xingwei Liu, Hao Tang, Xiangyi Yan, Yusheng Xie, Shih-Yao Lin, and Xiaohui Xie. Transfusion: Crossview fusion with transformer for 3d human pose estimation. arXiv preprint arXiv:2110.09554, 2021. [25] Dushyant Mehta, Helge Rhodin, Dan Casas, Pascal Fua, Oleksandr Sotnychenko, Weipeng Xu, and Christian Theobalt. Monocular 3d human pose estimation in the wild using improved cnn supervision. In 3D Vision (3DV), 2017 Fifth International Conference on. IEEE, 2017. [26] Ohoud Nafea, Wadood Abdul, Ghulam Muhammad, and Mansour Alsulaiman. Sensor-based human activity recognition with spatio-temporal deep learning. Sensors, 21(6):2141, 2021. [27] Mirela Ostrek, Helge Rhodin, Pascal Fua, Erich Müller, and Jörg Spörri. Are existing monocular computer vision-based 3d motion capture approaches ready for deployment? a methodological study on the example of alpine skiing. Sensors, 19(19):4323, 2019. [28] Sen Qiu, Hongkai Zhao, Nan Jiang, Zhelong Wang, Long Liu, Yi An, Hongyu Zhao, Xin Miao, Ruichen Liu, and Giancarlo Fortino. Multi-sensor information fusion based on machine learning for real applications in human activity recognition: State-of-the-art and research challenges. Information Fusion, 80:241–265,2022. [29] Amir Shahroudy, Jun Liu, Tian-Tsong Ng, and Gang Wang. Ntu rgb+ d: A large scale dataset for 3d human activity analysis. In Proceedings of the IEEEconference on computer vision and pattern recognition, pages 1010–1019, 2016. [30] Vijeta Sharma, Manjari Gupta, Anil Kumar Pandey, Deepti Mishra, and Ajai Kumar. A review of deep learning-based human activity recognition on benchmark video datasets. Applied Artificial Intelligence, 36(1):2093705, 2022. References [31] Wenkang Shan, Zhenhua Liu, Xinfeng Zhang, Shanshe Wang, Siwei Ma, and Wen Gao. P-stmo: Pre-trained spatial temporal many-to-one model for 3d human pose estimation. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part V, pages 461–478. Springer, 2022. [32] Lei Shi, Yifan Zhang, Jian Cheng, and Hanqing Lu. Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12026–12035, 2019. [33] Martin Simonovsky and Nikos Komodakis. Dynamic edge-conditioned filters in convolutional neural networks on graphs. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3693–3702, 2017. [34] Sijie Song, Cuiling Lan, Junliang Xing, Wenjun Zeng, and Jiaying Liu. An end-to-end spatio-temporal attention model for human action recognition from skeleton data. In Proceedings of the AAAI conference on artificial intelligence, volume 31, 2017. [35] Ke Sun, Bin Xiao, Dong Liu, and Jingdong Wang. Deep high-resolution representation learning for human pose estimation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5693–5703, 2019. [36] Arslan Syed, Eman A Aldhahri, Muhammad Munawar Iqbal, Abid Ali, Ammar Muthanna, Harun Jamil, and Faisal Jamil. Intelligent 3d network protocol for multimedia data classification using deep learning. arXiv preprint arXiv:2207.11504, 2022. [37] Chingizkhan Tangirbergenov, Adnan Yazici, and Enver Ever. Multi-stream orientation and position based adaptive graph convolutional network for skeleton based activity recognition. 2021. [38] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017. [39] Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. Graph attention networks. arXiv preprint arXiv:1710.10903, 2017. [40] Bin Xiao, Haiping Wu, and Yichen Wei. Simple baselines for human pose estimation and tracking. In Proceedings of the European conference on computer vision (ECCV), pages 466–481, 2018. [41] Sijie Yan, Yuanjun Xiong, and Dahua Lin. Spatial temporal graph convolutional networks for skeleton-based action recognition. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018. [42] Xinwei Yu. (fusionformer): Exploiting the joint motion synergy with fusion network based on transformer for 3d human pose estimation. arXiv preprint arXiv:2210.04006, 2022. [43] Ailing Zeng, Xiao Sun, Lei Yang, Nanxuan Zhao, Minhao Liu, and Qiang Xu. Learning skeletal graph neural networks for hard 3d pose estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11436–11445, 2021. [44] Jinlu Zhang, Zhigang Tu, Jianyu Yang, Yujin Chen, and Junsong Yuan. Mixste: Seq2seq mixed spatio-temporal encoder for 3d human pose estimation in video.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13232–13242, 2022. [45] Ce Zheng, Sijie Zhu, Matias Mendieta, Taojiannan Yang, Chen Chen, and Zhengming Ding. 3d human pose estimation with spatial and temporal transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 11656–11665, October 2021. image1.png image2.png image5.jpg image6.png image7.png image3.png image8.png image9.png image10.png image11.png image12.png image13.png image14.png image15.png image16.png image17.png image18.png image19.png image20.png image21.png image22.png image23.png image24.svg image25.png image26.png image27.png image28.png image270.png image29.png image30.png image31.png image32.gif image33.gif image34.gif image35.png