Multi-stream Convolutional Human Action Recognition Based on the Fusion of Spatio-Temporal Domain Attention Module

WU Ziyi; CHEN Minrong

doi:10.6054/j.jscnun.2023043

Journal of South China Normal University (Natural Science Edition) > 2023 > 55(3): 119-128. > DOI: 10.6054/j.jscnun.2023043

WU Ziyi, CHEN Minrong. Multi-stream Convolutional Human Action Recognition Based on the Fusion of Spatio-Temporal Domain Attention Module[J]. Journal of South China Normal University (Natural Science Edition), 2023, 55(3): 119-128. DOI: 10.6054/j.jscnun.2023043

Citation:

PDF (1041 KB)

Multi-stream Convolutional Human Action Recognition Based on the Fusion of Spatio-Temporal Domain Attention Module

WU Ziyi,
CHEN Minrong^,

School of Computer Science, South China Normal University, Guangzhou 510631, China

More Information

Received Date: September 11, 2021
Available Online: August 25, 2023

Graphical Abstract

Abstract

Abstract

In order to better extract and fuse the temporal and spatial features in the human skeleton, a multi-stream convolutional neural network (AE-MCN) that integrates spatio-temporal domain attention module is constructed in this paper. Aiming at the problem that most methods ignore the human motion characteristics when mo-deling the correlation of skeleton sequences, so that the scale of the action is not properly modeled, an adaptive selection motion-scale module is introduced in this paper, which can automatically extract key temporal features from the original scale action features; in order to better model features in the temporal and spatial dimensions, an attention module integrates spatio-temporal domain is designed to help the network extract more effective action information by assigning weights to high-dimensional spatio-temporal features. Finally, the comparative experiments were conducted on three commonly used human action recognition datasets (NTU60, JHMDB and UT-Kinect) to verify the effectiveness of the network AE-MCN proposed in this paper. The experimental results proved that compared with ST-GCN, SR-TSL and other networks, the network AE-MCN has achieved better recognition results, which proved that AE-MCN can effectively extract and model the action information, so as to obtain better action recognition performance.
- ction recognition,
- human skeleton,
- adaptive selection,
- attention mechanism,
- multi-stream convolutional neural network

FullText(HTML)

References (44)

References

[1]	BACCOUCHE M, MAMALET F, WOLF C, et al. Sequential deep learning for human action recognition[C]//International Workshop on Human Behavior Understanding. Berlin: Springer, 2011: 29-39.
[2]	FEICHTENHOFER C, PINZ A, ZISSERMAN A. Convolutional two-stream network fusion for video action recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016: 1933-1941.
[3]	SUN L, JIA K, YEUNG D Y, et al. Human action recognition using factorized spatio-temporal convolutional networks[C]//Proceedings of the IEEE International Conference on Computer Vision. Santiago: IEEE Computer Society, 2015: 4597-4605.
[4]	LIU Z, ZHANG C, TIAN Y. 3D-based deep convolutional neural network for action recognition with depth sequences[J]. Image and Vision Computing, 2016, 55: 93-100. doi: 10.1016/j.imavis.2016.04.004
[5]	KIM T S, REITER A. Interpretable 3D human action ana-lysis with temporal convolutional networks[C]//Procee-dings of the IEEE Conference on Computer Vision and Pa-ttern Recognition Workshops. Honolulu: IEEE, 2017: 1623-1631.
[6]	MOON G, CHANG J Y, LEE K M. Posefix: Model-agnostic general human pose refinement network[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Long Beach: IEEE, 2019: 7773-7781.
[7]	CAO Z, HIDALGO G, SIMON T, et al. OpenPose: realtime multi-person 2D pose estimation using part affinity fields[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019, 43(1): 172-186.
[8]	CHEN Y L, WANG Z C, PENG Y X, et al. Cascaded pyramid network for multi-person pose estimation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City: IEEE, 2018: 7103-7112.
[9]	CAO Z, SIMON T, WEI S E, et al. Realtime multi-person 2D pose estimation using part affinity fields[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Honolulu: IEEE, 2017: 7291-7299.
[10]	GREFF K, SRIVASTAVA R K, KOUTNÍK J, et al. LSTM: a search space odyssey[J]. IEEE Transactions on Neural Networks and Learning Systems, 2016, 28(10): 2222-2232.
[11]	LEE I, KIM D, KANG S, et al. Ensemble deep learning for skeleton-based action recognition using temporal sliding LSTM networks[C]//Proceedings of the IEEE International Conference on Computer Vision. Venice: IEEE, 2017: 1012-1020.
[12]	YAN S J, XIONG Y J, LIN D H. Spatial temporal graph convolutional networks for skeleton-based action recognition[C]//Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence. New Orleans, Louisiana: AAAI Press, 2018: 7444-7452.
[13]	LI S J, YI J H, FARHA Y A, et al. Pose refinement graph convolutional network for skeleton-based action recognition[J]. IEEE Robotics and Automation Letters, 2021, 6(2): 1028-1035. doi: 10.1109/LRA.2021.3056361
[14]	刘芳, 乔建忠, 代钦, 等. 基于双流多关系GCNs的骨架动作识别方法[J]. 东北大学学报(自然科学版), 2021, 42(6): 768-774. https://www.cnki.com.cn/Article/CJFDTOTAL-DBDX202106002.htm LIU F, QIAO J Z, DAI Q, et al. Skeleton-based action recognition method with two-stream multi-relational GCNs[J]. Journal of Northeastern University(Natural Science), 2021, 42(6): 768-774. https://www.cnki.com.cn/Article/CJFDTOTAL-DBDX202106002.htm
[15]	兰红, 何璠, 张蒲芬. 基于增强型图卷积的骨架识别模型[J/OL]. 计算机应用研究, 2021, 38(12): 3791-3795;3825. LAN H, HE F, ZHANG P F. Skeleton recognition model based on enhanced graph convolution[J]. Application Research of Computers, 2021, 38(12): 3791-3795;3825.
[16]	ZHANG P F, LAN C L, ZENG W J, et al. Semantics-Guided neural networks for efficient Skeleton-based human action recognition[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020: 1109-1118.
[17]	SI C Y, JING Y, WANG W, et al. Skeleton-based action recognition with hierarchical spatial reasoning and temporal stack learning network[J]. Pattern Recognition, 2020, 107: 107511/1-16. doi: 10.1016/j.patcog.2020.107511
[18]	LUDL D, GULDE T, CURIO C. Simple yet efficient real-time pose-based action recognition[C]//Proceedings of the IEEE Intelligent Transportation Systems Conference. Auckland: IEEE, 2019: 581-588.
[19]	PAVLLO D, FEICHTENHOFER C, GRANGIER D, et al. 3D human pose estimation in video with temporal convolutions and semi-supervised training[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Long Beach: IEEE, 2019: 7753-7762.
[20]	LI C, ZHONG Q Y, XIE D, et al. Co-occurrence feature learning from Skeleton data for action recognition and detection with hierarchical aggregation[C]//Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence. New Orleans: AAAI Press, 2018: 786-792.
[21]	YANG F, WU Y, SAKTI S, et al. Make skeleton-based action recognition model smaller, faster and better[C]//Proceedings of the ACM Multimedia Asia. New York: ACM, 2019: 1-6.
[22]	JIE H, LI S, GANG S, et al. Squeeze-and-excitation networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020, 42(8): 2011-2023. doi: 10.1109/TPAMI.2019.2913372
[23]	HEIDARI N, IOSIFIDIS A. Temporal attention-augmented graph convolutional network for efficient skeleton-based human action recognition[C]//Proceedings of the 25th International Conference on Pattern Recognition. Milan: IEEE, 2021: 7907-7914.
[24]	FAN Y B, WENG S C, ZHANG Y, et al. Context-aware cross-attention for skeleton-based human action recognition[J]. IEEE Access, 2020, 8: 15280-15290. doi: 10.1109/ACCESS.2020.2968054
[25]	SI C Y, CHEN W T, WANG W, et al. An attention enhanced graph convolutional LSTM network for skeleton-based action recognition[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach: IEEE, 2019: 1227-1236.
[26]	ZHANG S, LIU X, XIAO J. On geometric features for skeleton-based action recognition using multilayer LSTM networks[C]//Proceedings of the IEEE Winter Confe-rence on Applications of Computer Vision. Santa Rosa: IEEE, 2017: 148-157.
[27]	ZHANG S, YANG Y, XIAO J, et al. Fusing geometric features for skeleton-based action recognition using multilayer LSTM networks[J]. IEEE Transactions on Multimedia, 2018, 20(9): 2330-2343. doi: 10.1109/TMM.2018.2802648
[28]	WANG H, WANG L. Modeling temporal dynamics and spatial configurations of actions using two-stream recurrent neural networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Honolulu: IEEE, 2017: 499-508.
[29]	SONG S, LAN C, XING J, et al. An end-to-end spatio-temporal attention model for human action recognition from skeleton data[C]//Proceedings of the AAAI Confe-rence on Artificial Intelligence. San Francisco: AAAI, 2017: 4263-4270.
[30]	HOU J, WANG G, CHEN X, et al. Spatial-temporal attention res-TCN for skeleton-based dynamic hand gesture recognition[C]//Proceedings of the European Conference on Computer Vision (ECCV) Workshops. Berlin: Springer, 2018: 273-286.
[31]	SHAHROUDY A, LIU J, NG T T, et al. Ntu rgb+ d: A large scale dataset for 3d human activity analysis[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Lasvegas: IEEE, 2016: 1010-1019.
[32]	JHUANG H, GALL J, ZUFFI S, et al. Towards understanding action recognition[C]//Proceedings of the IEEE International Conference on Computer Vision. Sydney: IEEE, 2013: 3192-3199.
[33]	XIA L, CHEN C C, AGGARWAL J K. View invariant human action recognition using histograms of 3d joints[C]//Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops. Providence: IEEE, 2012: 20-27.
[34]	PASZKE A, GROSS S, CHINTALA S, et al. Automatic differentiation in Pytorch[J/OL]. NIPS-W 2017 Workshop Autodiff Submission, (2017-10-29)[2022-03-20]. https://openreview.net/forum?id=BJJsrmfCZ&noteId=rkK3fzZJz.
[35]	KINGMA D, BA J. Adam: A method for stochastic optimization[J]. Computer Science, 2015, 5: 7-9.
[36]	HE T, ZHANG Z, ZHANG H, et al. Bag of tricks for image classification with convolutional neural networks[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach: IEEE, 2019: 558-567.
[37]	ZHANG P F, LAN C L, XING J L, et al. View adaptive recurrent neural networks for high performance human action recognition from skeleton data[C]//Proceedings of the IEEE International Conference on Computer Vision. Venice: IEEE, 2017: 2136-2145.
[38]	ZHANG P F, XUE J R, LAN C L, et al. Adding attentiveness to the neurons in recurrent neural networks[C]//Proceedings of the 15th Computer Vision-ECCV European Conference. Berlin: Springer, 2018: 136-152.
[39]	TANG Y S, TIAN Y, JIWEN L. et al. Deep progressive reinforcement learning for skeleton-based action recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City: IEEE, 2018: 5323-5332.
[40]	ZOLFAGHARI M, OLIVEIRA G L, SEDAGHAT N, et al. Chained multi-stream networks exploiting pose, motion, and appearance for action classification and detection[C]//Proceedings of the IEEE International Conference on Computer Vision. Venice: IEEE, 2017: 2904-2913.
[41]	CHOUTAS V, WEINZAEPFEL P, REVAUD J, et al. Potion: Pose motion representation for action recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City: IEEE, 2018: 7024-7033.
[42]	ZHU Y, CHEN W, GUO G. Fusing spatiotemporal features and joints for 3D action recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. Portland: IEEE, 2013: 486-491.
[43]	ANIRUDH R, TURAGA P, SU J, et al. Elastic functional coding of human actions: from vector-fields to latent variables[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Boston: IEEE, 2015: 3147-3155.
[44]	KAO J Y, ORTEGA A, TIAN D, et al. Graph based skeleton modeling for human activity analysis[C]//Procee-dings of the IEEE International Conference on Image Processing. Taipei: IEEE, 2019: 2025-2029.

Cited By

Cited by

Periodical cited type(3)

1.	陈威，葛士顺. 竞技武术套路中难度动作智能识别方法. 新乡学院学报. 2024(06): 72-76 .
2.	徐静. 基于感知学习算法的啦啦操动作风格识别与性能分析. 景德镇学院学报. 2024(03): 48-52 .
3.	廖民玲. 基于显著性特征的多视角人体动作图像识别研究. 现代电子技术. 2024(24): 143-147 .

Other cited types(1)

Get Citation

PDF

XML

Article views (93) PDF downloads (25) Cited by(4)

Turn off MathJax

Article Contents

Abstract

References

Multi-stream Convolutional Human Action Recognition Based on the Fusion of Spatio-Temporal Domain Attention Module

Abstract

References

Cited by

Periodical cited type(3)

Other cited types(1)

Catalog

Export File

Citation

Format

Content