• Overview of Chinese core journals
  • Chinese Science Citation Database(CSCD)
  • Chinese Scientific and Technological Paper and Citation Database (CSTPCD)
  • China National Knowledge Infrastructure(CNKI)
  • Chinese Science Abstracts Database(CSAD)
  • JST China
  • SCOPUS
WU Ziyi, CHEN Minrong. Multi-stream Convolutional Human Action Recognition Based on the Fusion of Spatio-Temporal Domain Attention Module[J]. Journal of South China Normal University (Natural Science Edition), 2023, 55(3): 119-128. DOI: 10.6054/j.jscnun.2023043
Citation: WU Ziyi, CHEN Minrong. Multi-stream Convolutional Human Action Recognition Based on the Fusion of Spatio-Temporal Domain Attention Module[J]. Journal of South China Normal University (Natural Science Edition), 2023, 55(3): 119-128. DOI: 10.6054/j.jscnun.2023043

Multi-stream Convolutional Human Action Recognition Based on the Fusion of Spatio-Temporal Domain Attention Module

More Information
  • Received Date: September 11, 2021
  • Available Online: August 25, 2023
  • In order to better extract and fuse the temporal and spatial features in the human skeleton, a multi-stream convolutional neural network (AE-MCN) that integrates spatio-temporal domain attention module is constructed in this paper. Aiming at the problem that most methods ignore the human motion characteristics when mo-deling the correlation of skeleton sequences, so that the scale of the action is not properly modeled, an adaptive selection motion-scale module is introduced in this paper, which can automatically extract key temporal features from the original scale action features; in order to better model features in the temporal and spatial dimensions, an attention module integrates spatio-temporal domain is designed to help the network extract more effective action information by assigning weights to high-dimensional spatio-temporal features. Finally, the comparative experiments were conducted on three commonly used human action recognition datasets (NTU60, JHMDB and UT-Kinect) to verify the effectiveness of the network AE-MCN proposed in this paper. The experimental results proved that compared with ST-GCN, SR-TSL and other networks, the network AE-MCN has achieved better recognition results, which proved that AE-MCN can effectively extract and model the action information, so as to obtain better action recognition performance.
  • [1]
    BACCOUCHE M, MAMALET F, WOLF C, et al. Sequential deep learning for human action recognition[C]//International Workshop on Human Behavior Understanding. Berlin: Springer, 2011: 29-39.
    [2]
    FEICHTENHOFER C, PINZ A, ZISSERMAN A. Convolutional two-stream network fusion for video action recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016: 1933-1941.
    [3]
    SUN L, JIA K, YEUNG D Y, et al. Human action recognition using factorized spatio-temporal convolutional networks[C]//Proceedings of the IEEE International Conference on Computer Vision. Santiago: IEEE Computer Society, 2015: 4597-4605.
    [4]
    LIU Z, ZHANG C, TIAN Y. 3D-based deep convolutional neural network for action recognition with depth sequences[J]. Image and Vision Computing, 2016, 55: 93-100. doi: 10.1016/j.imavis.2016.04.004
    [5]
    KIM T S, REITER A. Interpretable 3D human action ana-lysis with temporal convolutional networks[C]//Procee-dings of the IEEE Conference on Computer Vision and Pa-ttern Recognition Workshops. Honolulu: IEEE, 2017: 1623-1631.
    [6]
    MOON G, CHANG J Y, LEE K M. Posefix: Model-agnostic general human pose refinement network[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Long Beach: IEEE, 2019: 7773-7781.
    [7]
    CAO Z, HIDALGO G, SIMON T, et al. OpenPose: realtime multi-person 2D pose estimation using part affinity fields[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019, 43(1): 172-186.
    [8]
    CHEN Y L, WANG Z C, PENG Y X, et al. Cascaded pyramid network for multi-person pose estimation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City: IEEE, 2018: 7103-7112.
    [9]
    CAO Z, SIMON T, WEI S E, et al. Realtime multi-person 2D pose estimation using part affinity fields[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Honolulu: IEEE, 2017: 7291-7299.
    [10]
    GREFF K, SRIVASTAVA R K, KOUTNÍK J, et al. LSTM: a search space odyssey[J]. IEEE Transactions on Neural Networks and Learning Systems, 2016, 28(10): 2222-2232.
    [11]
    LEE I, KIM D, KANG S, et al. Ensemble deep learning for skeleton-based action recognition using temporal sliding LSTM networks[C]//Proceedings of the IEEE International Conference on Computer Vision. Venice: IEEE, 2017: 1012-1020.
    [12]
    YAN S J, XIONG Y J, LIN D H. Spatial temporal graph convolutional networks for skeleton-based action recognition[C]//Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence. New Orleans, Louisiana: AAAI Press, 2018: 7444-7452.
    [13]
    LI S J, YI J H, FARHA Y A, et al. Pose refinement graph convolutional network for skeleton-based action recognition[J]. IEEE Robotics and Automation Letters, 2021, 6(2): 1028-1035. doi: 10.1109/LRA.2021.3056361
    [14]
    刘芳, 乔建忠, 代钦, 等. 基于双流多关系GCNs的骨架动作识别方法[J]. 东北大学学报(自然科学版), 2021, 42(6): 768-774. https://www.cnki.com.cn/Article/CJFDTOTAL-DBDX202106002.htm

    LIU F, QIAO J Z, DAI Q, et al. Skeleton-based action recognition method with two-stream multi-relational GCNs[J]. Journal of Northeastern University(Natural Science), 2021, 42(6): 768-774. https://www.cnki.com.cn/Article/CJFDTOTAL-DBDX202106002.htm
    [15]
    兰红, 何璠, 张蒲芬. 基于增强型图卷积的骨架识别模型[J/OL]. 计算机应用研究, 2021, 38(12): 3791-3795;3825.

    LAN H, HE F, ZHANG P F. Skeleton recognition model based on enhanced graph convolution[J]. Application Research of Computers, 2021, 38(12): 3791-3795;3825.
    [16]
    ZHANG P F, LAN C L, ZENG W J, et al. Semantics-Guided neural networks for efficient Skeleton-based human action recognition[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020: 1109-1118.
    [17]
    SI C Y, JING Y, WANG W, et al. Skeleton-based action recognition with hierarchical spatial reasoning and temporal stack learning network[J]. Pattern Recognition, 2020, 107: 107511/1-16. doi: 10.1016/j.patcog.2020.107511
    [18]
    LUDL D, GULDE T, CURIO C. Simple yet efficient real-time pose-based action recognition[C]//Proceedings of the IEEE Intelligent Transportation Systems Conference. Auckland: IEEE, 2019: 581-588.
    [19]
    PAVLLO D, FEICHTENHOFER C, GRANGIER D, et al. 3D human pose estimation in video with temporal convolutions and semi-supervised training[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Long Beach: IEEE, 2019: 7753-7762.
    [20]
    LI C, ZHONG Q Y, XIE D, et al. Co-occurrence feature learning from Skeleton data for action recognition and detection with hierarchical aggregation[C]//Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence. New Orleans: AAAI Press, 2018: 786-792.
    [21]
    YANG F, WU Y, SAKTI S, et al. Make skeleton-based action recognition model smaller, faster and better[C]//Proceedings of the ACM Multimedia Asia. New York: ACM, 2019: 1-6.
    [22]
    JIE H, LI S, GANG S, et al. Squeeze-and-excitation networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020, 42(8): 2011-2023. doi: 10.1109/TPAMI.2019.2913372
    [23]
    HEIDARI N, IOSIFIDIS A. Temporal attention-augmented graph convolutional network for efficient skeleton-based human action recognition[C]//Proceedings of the 25th International Conference on Pattern Recognition. Milan: IEEE, 2021: 7907-7914.
    [24]
    FAN Y B, WENG S C, ZHANG Y, et al. Context-aware cross-attention for skeleton-based human action recognition[J]. IEEE Access, 2020, 8: 15280-15290. doi: 10.1109/ACCESS.2020.2968054
    [25]
    SI C Y, CHEN W T, WANG W, et al. An attention enhanced graph convolutional LSTM network for skeleton-based action recognition[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach: IEEE, 2019: 1227-1236.
    [26]
    ZHANG S, LIU X, XIAO J. On geometric features for skeleton-based action recognition using multilayer LSTM networks[C]//Proceedings of the IEEE Winter Confe-rence on Applications of Computer Vision. Santa Rosa: IEEE, 2017: 148-157.
    [27]
    ZHANG S, YANG Y, XIAO J, et al. Fusing geometric features for skeleton-based action recognition using multilayer LSTM networks[J]. IEEE Transactions on Multimedia, 2018, 20(9): 2330-2343. doi: 10.1109/TMM.2018.2802648
    [28]
    WANG H, WANG L. Modeling temporal dynamics and spatial configurations of actions using two-stream recurrent neural networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Honolulu: IEEE, 2017: 499-508.
    [29]
    SONG S, LAN C, XING J, et al. An end-to-end spatio-temporal attention model for human action recognition from skeleton data[C]//Proceedings of the AAAI Confe-rence on Artificial Intelligence. San Francisco: AAAI, 2017: 4263-4270.
    [30]
    HOU J, WANG G, CHEN X, et al. Spatial-temporal attention res-TCN for skeleton-based dynamic hand gesture recognition[C]//Proceedings of the European Conference on Computer Vision (ECCV) Workshops. Berlin: Springer, 2018: 273-286.
    [31]
    SHAHROUDY A, LIU J, NG T T, et al. Ntu rgb+ d: A large scale dataset for 3d human activity analysis[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Lasvegas: IEEE, 2016: 1010-1019.
    [32]
    JHUANG H, GALL J, ZUFFI S, et al. Towards understanding action recognition[C]//Proceedings of the IEEE International Conference on Computer Vision. Sydney: IEEE, 2013: 3192-3199.
    [33]
    XIA L, CHEN C C, AGGARWAL J K. View invariant human action recognition using histograms of 3d joints[C]//Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops. Providence: IEEE, 2012: 20-27.
    [34]
    PASZKE A, GROSS S, CHINTALA S, et al. Automatic differentiation in Pytorch[J/OL]. NIPS-W 2017 Workshop Autodiff Submission, (2017-10-29)[2022-03-20]. https://openreview.net/forum?id=BJJsrmfCZ&noteId=rkK3fzZJz.
    [35]
    KINGMA D, BA J. Adam: A method for stochastic optimization[J]. Computer Science, 2015, 5: 7-9.
    [36]
    HE T, ZHANG Z, ZHANG H, et al. Bag of tricks for image classification with convolutional neural networks[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach: IEEE, 2019: 558-567.
    [37]
    ZHANG P F, LAN C L, XING J L, et al. View adaptive recurrent neural networks for high performance human action recognition from skeleton data[C]//Proceedings of the IEEE International Conference on Computer Vision. Venice: IEEE, 2017: 2136-2145.
    [38]
    ZHANG P F, XUE J R, LAN C L, et al. Adding attentiveness to the neurons in recurrent neural networks[C]//Proceedings of the 15th Computer Vision-ECCV European Conference. Berlin: Springer, 2018: 136-152.
    [39]
    TANG Y S, TIAN Y, JIWEN L. et al. Deep progressive reinforcement learning for skeleton-based action recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City: IEEE, 2018: 5323-5332.
    [40]
    ZOLFAGHARI M, OLIVEIRA G L, SEDAGHAT N, et al. Chained multi-stream networks exploiting pose, motion, and appearance for action classification and detection[C]//Proceedings of the IEEE International Conference on Computer Vision. Venice: IEEE, 2017: 2904-2913.
    [41]
    CHOUTAS V, WEINZAEPFEL P, REVAUD J, et al. Potion: Pose motion representation for action recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City: IEEE, 2018: 7024-7033.
    [42]
    ZHU Y, CHEN W, GUO G. Fusing spatiotemporal features and joints for 3D action recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. Portland: IEEE, 2013: 486-491.
    [43]
    ANIRUDH R, TURAGA P, SU J, et al. Elastic functional coding of human actions: from vector-fields to latent variables[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Boston: IEEE, 2015: 3147-3155.
    [44]
    KAO J Y, ORTEGA A, TIAN D, et al. Graph based skeleton modeling for human activity analysis[C]//Procee-dings of the IEEE International Conference on Image Processing. Taipei: IEEE, 2019: 2025-2029.
  • Cited by

    Periodical cited type(3)

    1. 陈威,葛士顺. 竞技武术套路中难度动作智能识别方法. 新乡学院学报. 2024(06): 72-76 .
    2. 徐静. 基于感知学习算法的啦啦操动作风格识别与性能分析. 景德镇学院学报. 2024(03): 48-52 .
    3. 廖民玲. 基于显著性特征的多视角人体动作图像识别研究. 现代电子技术. 2024(24): 143-147 .

    Other cited types(1)

Catalog

    Article views (93) PDF downloads (25) Cited by(4)

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return