留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

融合时空域注意力模块的多流卷积人体动作识别

吴子依 陈泯融

吴子依, 陈泯融. 融合时空域注意力模块的多流卷积人体动作识别[J]. 华南师范大学学报(自然科学版), 2023, 55(3): 119-128. doi: 10.6054/j.jscnun.2023043
引用本文: 吴子依, 陈泯融. 融合时空域注意力模块的多流卷积人体动作识别[J]. 华南师范大学学报(自然科学版), 2023, 55(3): 119-128. doi: 10.6054/j.jscnun.2023043
WU Ziyi, CHEN Minrong. Multi-stream Convolutional Human Action Recognition Based on the Fusion of Spatio-Temporal Domain Attention Module[J]. Journal of South China Normal University (Natural Science Edition), 2023, 55(3): 119-128. doi: 10.6054/j.jscnun.2023043
Citation: WU Ziyi, CHEN Minrong. Multi-stream Convolutional Human Action Recognition Based on the Fusion of Spatio-Temporal Domain Attention Module[J]. Journal of South China Normal University (Natural Science Edition), 2023, 55(3): 119-128. doi: 10.6054/j.jscnun.2023043

融合时空域注意力模块的多流卷积人体动作识别

doi: 10.6054/j.jscnun.2023043
基金项目: 

国家自然科学基金项目 61872153

详细信息
    通讯作者:

    陈泯融,Email:chenminrong@scnu.edu.cn

  • 中图分类号: TP391

Multi-stream Convolutional Human Action Recognition Based on the Fusion of Spatio-Temporal Domain Attention Module

  • 摘要: 为了更好地提取并融合人体骨架中的时序特征和空间特征,文章构建了融合时空域注意力模块的多流卷积神经网络(AE-MCN):针对目前大多数方法在建模骨架序列相关性时因忽略了人体运动特性而没有对运动尺度进行适当建模的问题,引入了自适应选取运动尺度模块,从原尺度动作特征中自适应地提取关键时序特征;为了更好地对特征进行时间维度和空间维度上的建模,设计了融合时空域的注意力模块,通过对高维时空特征进行权重分配,进而帮助网络提取更有效的动作信息。最后,在3个常用的人体动作识别数据集(NTU60、JHMDB和UT-Kinect)上进行了对比实验,以验证AE-MCN网络的有效性。实验结果表明:与ST-GCN、SR-TSL等网络相比,AE-MCN网络都取得了更好的识别效果,证明AE-MCN网络可以对动作信息进行有效的提取与建模,从而获得较好的动作识别性能。
  • 图  1  通道注意力模块示意图[28]

    Figure  1.  The diagram of the channel attention module[28]

    图  2  自适应选取运动尺度模块实现原理图

    Figure  2.  The implementation schematic diagram of the adaptive motion scale selection module

    图  3  融合时间域和空间域的注意力模块图

    Figure  3.  The diagram of attention module integrating temporal and spatial domain

    图  4  AE-MCN网络结构图

    Figure  4.  The network structure diagram of AE-MCN

    表  1  不同自适应选取运动尺度模块在NTU60数据集上的性能

    Table  1.   The performance of different adaptive selection of motion scale modules on NTU60 dataset

    网络 识别准确率/%
    CS基准 CV基准
    baseline 84.5 91.0
    AE-MCN-A 81.8 86.8
    AE-MCN-B 83.7 89.5
    AE-MCN-C 84.9 91.3
    下载: 导出CSV

    表  2  不同组合方式的时间和空间注意力模块在NTU60数据集上的性能

    Table  2.   The performance of temporal and spatial attention mo-dule with different combinations on NTU60 dataset

    网络 识别准确率/%
    CS CV
    AE-MCN-C 84.9 91.3
    AE-MCN-C+TAM 85.1 91.6
    AE-MCN-C+SAM 85.3 92.2
    AE-MCN-C+TAM+SAM(Serial) 85.5 91.7
    AE-MCN-C+TAM+SAM(Parallel) 85.9 91.8
    AE-MCN-C+SAM+TAM(Serial) 86.3 92.4
    下载: 导出CSV

    表  3  不同网络在NTU60数据集上的性能比较

    Table  3.   The performance comparison of different networks on NTU60 dataset

    网络 识别准确率/%
    CS基准 CV基准
    VA-LSTM[37] 79.2 87.7
    ElAtt-GRU[38] 80.7 88.4
    ST-GCN[12] 81.5 88.3
    DPRL+GCNN[39] 83.5 89.8
    SR-TSL[17] 84.8 92.4
    PR-GCN[13] 85.2 91.7
    AE-MCN 86.3 92.4
    下载: 导出CSV

    表  4  不同网络在JHMDB数据集上的性能比较

    Table  4.   The performance comparison of different networks on JHMDB dataset

    网络 识别准确率/%
    Chained Net[40] 56.8
    EHPI[18] 65.5
    PoTion[41] 67.9
    DD-Net[21] 81.6
    AE-MCN 83.5
    下载: 导出CSV

    表  5  不同网络在UT-Kinect数据集上的性能比较

    Table  5.   The performance comparison of different networks on UT-Kinect dataset

    网络 识别准确率/%
    FusingFeatures[42] 87.9
    ElasticCoding[43] 94.9
    GeoFeat[22] 95.9
    GFT[44] 96.0
    AE-MCN 97.9
    下载: 导出CSV
  • [1] BACCOUCHE M, MAMALET F, WOLF C, et al. Sequential deep learning for human action recognition[C]//International Workshop on Human Behavior Understanding. Berlin: Springer, 2011: 29-39.
    [2] FEICHTENHOFER C, PINZ A, ZISSERMAN A. Convolutional two-stream network fusion for video action recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016: 1933-1941.
    [3] SUN L, JIA K, YEUNG D Y, et al. Human action recognition using factorized spatio-temporal convolutional networks[C]//Proceedings of the IEEE International Conference on Computer Vision. Santiago: IEEE Computer Society, 2015: 4597-4605.
    [4] LIU Z, ZHANG C, TIAN Y. 3D-based deep convolutional neural network for action recognition with depth sequences[J]. Image and Vision Computing, 2016, 55: 93-100. doi: 10.1016/j.imavis.2016.04.004
    [5] KIM T S, REITER A. Interpretable 3D human action ana-lysis with temporal convolutional networks[C]//Procee-dings of the IEEE Conference on Computer Vision and Pa-ttern Recognition Workshops. Honolulu: IEEE, 2017: 1623-1631.
    [6] MOON G, CHANG J Y, LEE K M. Posefix: Model-agnostic general human pose refinement network[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Long Beach: IEEE, 2019: 7773-7781.
    [7] CAO Z, HIDALGO G, SIMON T, et al. OpenPose: realtime multi-person 2D pose estimation using part affinity fields[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019, 43(1): 172-186.
    [8] CHEN Y L, WANG Z C, PENG Y X, et al. Cascaded pyramid network for multi-person pose estimation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City: IEEE, 2018: 7103-7112.
    [9] CAO Z, SIMON T, WEI S E, et al. Realtime multi-person 2D pose estimation using part affinity fields[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Honolulu: IEEE, 2017: 7291-7299.
    [10] GREFF K, SRIVASTAVA R K, KOUTNÍK J, et al. LSTM: a search space odyssey[J]. IEEE Transactions on Neural Networks and Learning Systems, 2016, 28(10): 2222-2232.
    [11] LEE I, KIM D, KANG S, et al. Ensemble deep learning for skeleton-based action recognition using temporal sliding LSTM networks[C]//Proceedings of the IEEE International Conference on Computer Vision. Venice: IEEE, 2017: 1012-1020.
    [12] YAN S J, XIONG Y J, LIN D H. Spatial temporal graph convolutional networks for skeleton-based action recognition[C]//Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence. New Orleans, Louisiana: AAAI Press, 2018: 7444-7452.
    [13] LI S J, YI J H, FARHA Y A, et al. Pose refinement graph convolutional network for skeleton-based action recognition[J]. IEEE Robotics and Automation Letters, 2021, 6(2): 1028-1035. doi: 10.1109/LRA.2021.3056361
    [14] 刘芳, 乔建忠, 代钦, 等. 基于双流多关系GCNs的骨架动作识别方法[J]. 东北大学学报(自然科学版), 2021, 42(6): 768-774. https://www.cnki.com.cn/Article/CJFDTOTAL-DBDX202106002.htm

    LIU F, QIAO J Z, DAI Q, et al. Skeleton-based action recognition method with two-stream multi-relational GCNs[J]. Journal of Northeastern University(Natural Science), 2021, 42(6): 768-774. https://www.cnki.com.cn/Article/CJFDTOTAL-DBDX202106002.htm
    [15] 兰红, 何璠, 张蒲芬. 基于增强型图卷积的骨架识别模型[J/OL]. 计算机应用研究, 2021, 38(12): 3791-3795;3825.

    LAN H, HE F, ZHANG P F. Skeleton recognition model based on enhanced graph convolution[J]. Application Research of Computers, 2021, 38(12): 3791-3795;3825.
    [16] ZHANG P F, LAN C L, ZENG W J, et al. Semantics-Guided neural networks for efficient Skeleton-based human action recognition[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020: 1109-1118.
    [17] SI C Y, JING Y, WANG W, et al. Skeleton-based action recognition with hierarchical spatial reasoning and temporal stack learning network[J]. Pattern Recognition, 2020, 107: 107511/1-16. doi: 10.1016/j.patcog.2020.107511
    [18] LUDL D, GULDE T, CURIO C. Simple yet efficient real-time pose-based action recognition[C]//Proceedings of the IEEE Intelligent Transportation Systems Conference. Auckland: IEEE, 2019: 581-588.
    [19] PAVLLO D, FEICHTENHOFER C, GRANGIER D, et al. 3D human pose estimation in video with temporal convolutions and semi-supervised training[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Long Beach: IEEE, 2019: 7753-7762.
    [20] LI C, ZHONG Q Y, XIE D, et al. Co-occurrence feature learning from Skeleton data for action recognition and detection with hierarchical aggregation[C]//Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence. New Orleans: AAAI Press, 2018: 786-792.
    [21] YANG F, WU Y, SAKTI S, et al. Make skeleton-based action recognition model smaller, faster and better[C]//Proceedings of the ACM Multimedia Asia. New York: ACM, 2019: 1-6.
    [22] JIE H, LI S, GANG S, et al. Squeeze-and-excitation networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020, 42(8): 2011-2023. doi: 10.1109/TPAMI.2019.2913372
    [23] HEIDARI N, IOSIFIDIS A. Temporal attention-augmented graph convolutional network for efficient skeleton-based human action recognition[C]//Proceedings of the 25th International Conference on Pattern Recognition. Milan: IEEE, 2021: 7907-7914.
    [24] FAN Y B, WENG S C, ZHANG Y, et al. Context-aware cross-attention for skeleton-based human action recognition[J]. IEEE Access, 2020, 8: 15280-15290. doi: 10.1109/ACCESS.2020.2968054
    [25] SI C Y, CHEN W T, WANG W, et al. An attention enhanced graph convolutional LSTM network for skeleton-based action recognition[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach: IEEE, 2019: 1227-1236.
    [26] ZHANG S, LIU X, XIAO J. On geometric features for skeleton-based action recognition using multilayer LSTM networks[C]//Proceedings of the IEEE Winter Confe-rence on Applications of Computer Vision. Santa Rosa: IEEE, 2017: 148-157.
    [27] ZHANG S, YANG Y, XIAO J, et al. Fusing geometric features for skeleton-based action recognition using multilayer LSTM networks[J]. IEEE Transactions on Multimedia, 2018, 20(9): 2330-2343. doi: 10.1109/TMM.2018.2802648
    [28] WANG H, WANG L. Modeling temporal dynamics and spatial configurations of actions using two-stream recurrent neural networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Honolulu: IEEE, 2017: 499-508.
    [29] SONG S, LAN C, XING J, et al. An end-to-end spatio-temporal attention model for human action recognition from skeleton data[C]//Proceedings of the AAAI Confe-rence on Artificial Intelligence. San Francisco: AAAI, 2017: 4263-4270.
    [30] HOU J, WANG G, CHEN X, et al. Spatial-temporal attention res-TCN for skeleton-based dynamic hand gesture recognition[C]//Proceedings of the European Conference on Computer Vision (ECCV) Workshops. Berlin: Springer, 2018: 273-286.
    [31] SHAHROUDY A, LIU J, NG T T, et al. Ntu rgb+ d: A large scale dataset for 3d human activity analysis[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Lasvegas: IEEE, 2016: 1010-1019.
    [32] JHUANG H, GALL J, ZUFFI S, et al. Towards understanding action recognition[C]//Proceedings of the IEEE International Conference on Computer Vision. Sydney: IEEE, 2013: 3192-3199.
    [33] XIA L, CHEN C C, AGGARWAL J K. View invariant human action recognition using histograms of 3d joints[C]//Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops. Providence: IEEE, 2012: 20-27.
    [34] PASZKE A, GROSS S, CHINTALA S, et al. Automatic differentiation in Pytorch[J/OL]. NIPS-W 2017 Workshop Autodiff Submission, (2017-10-29)[2022-03-20]. https://openreview.net/forum?id=BJJsrmfCZ&noteId=rkK3fzZJz.
    [35] KINGMA D, BA J. Adam: A method for stochastic optimization[J]. Computer Science, 2015, 5: 7-9.
    [36] HE T, ZHANG Z, ZHANG H, et al. Bag of tricks for image classification with convolutional neural networks[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach: IEEE, 2019: 558-567.
    [37] ZHANG P F, LAN C L, XING J L, et al. View adaptive recurrent neural networks for high performance human action recognition from skeleton data[C]//Proceedings of the IEEE International Conference on Computer Vision. Venice: IEEE, 2017: 2136-2145.
    [38] ZHANG P F, XUE J R, LAN C L, et al. Adding attentiveness to the neurons in recurrent neural networks[C]//Proceedings of the 15th Computer Vision-ECCV European Conference. Berlin: Springer, 2018: 136-152.
    [39] TANG Y S, TIAN Y, JIWEN L. et al. Deep progressive reinforcement learning for skeleton-based action recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City: IEEE, 2018: 5323-5332.
    [40] ZOLFAGHARI M, OLIVEIRA G L, SEDAGHAT N, et al. Chained multi-stream networks exploiting pose, motion, and appearance for action classification and detection[C]//Proceedings of the IEEE International Conference on Computer Vision. Venice: IEEE, 2017: 2904-2913.
    [41] CHOUTAS V, WEINZAEPFEL P, REVAUD J, et al. Potion: Pose motion representation for action recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City: IEEE, 2018: 7024-7033.
    [42] ZHU Y, CHEN W, GUO G. Fusing spatiotemporal features and joints for 3D action recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. Portland: IEEE, 2013: 486-491.
    [43] ANIRUDH R, TURAGA P, SU J, et al. Elastic functional coding of human actions: from vector-fields to latent variables[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Boston: IEEE, 2015: 3147-3155.
    [44] KAO J Y, ORTEGA A, TIAN D, et al. Graph based skeleton modeling for human activity analysis[C]//Procee-dings of the IEEE International Conference on Image Processing. Taipei: IEEE, 2019: 2025-2029.
  • 加载中
图(4) / 表(5)
计量
  • 文章访问数:  7
  • HTML全文浏览量:  1
  • PDF下载量:  0
  • 被引次数: 0
出版历程
  • 收稿日期:  2021-09-12
  • 网络出版日期:  2023-08-26
  • 刊出日期:  2023-06-25

目录

    /

    返回文章
    返回