基于多流融合网络的3D骨架人体行为识别

陈泯融, 彭俊杰, 曾国强

陈泯融, 彭俊杰, 曾国强. 基于多流融合网络的3D骨架人体行为识别[J]. 华南师范大学学报(自然科学版), 2023, 55(1): 94-101. DOI: 10.6054/j.jscnun.2023009
引用本文: 陈泯融, 彭俊杰, 曾国强. 基于多流融合网络的3D骨架人体行为识别[J]. 华南师范大学学报(自然科学版), 2023, 55(1): 94-101. DOI: 10.6054/j.jscnun.2023009
CHEN Minrong, PENG Junjie, ZENG Guoqiang. 3D Skeleton-based Human Action Recognition Based on Multi-stream Fusion Network[J]. Journal of South China Normal University (Natural Science Edition), 2023, 55(1): 94-101. DOI: 10.6054/j.jscnun.2023009
Citation: CHEN Minrong, PENG Junjie, ZENG Guoqiang. 3D Skeleton-based Human Action Recognition Based on Multi-stream Fusion Network[J]. Journal of South China Normal University (Natural Science Edition), 2023, 55(1): 94-101. DOI: 10.6054/j.jscnun.2023009

基于多流融合网络的3D骨架人体行为识别

基金项目: 

国家自然科学基金项目 61872153

国家自然科学基金项目 61972288

详细信息
    通讯作者:

    陈泯融, Email: chenminrong@scnu.edu.cn

  • 中图分类号: TP391

3D Skeleton-based Human Action Recognition Based on Multi-stream Fusion Network

  • 摘要: 当前大多基于卷积神经网络的3D骨架人体行为识别模型没有充分挖掘骨架序列所蕴含的几何特征,为了弥补这方面的不足,文章在AIF-CNN模型的基础上进行改进,提出多流融合网络模型(MS-CNN)。在此模型中,新增一种几何特征(kernel特征)作为输入,起到了丰富原始特征的作用;新增多运动特征,使模型学习到更加健壮的全局运动信息。最后,在NTU RGB+D 60数据集上进行消融实验,分别在NTU RGB+D 60数据集、NTU RGB+D 120数据集上,将MS-CNN模型与19、8个行为识别模型进行对比实验。消融实验结果表明:MS-CNN模型采用joint特征与kernel特征融合,其识别准确率比与core特征融合的高;随着多运动特征的增多,MS-CNN模型的识别准确率有所提高。对比实验结果表明:MS-CNN模型在2个评估策略下的识别准确率超过了大部分对比模型(包括基准AIF-CNN模型)。
    Abstract: Most of the current 3D skeleton human action recognition models based on convolutional neural network do not fully explore the geometric features embedded in skeleton sequences. To make up for this deficiency, based on the AIF-CNN model, the multi-stream fusion network model (MS-CNN for short) is proposed. A geometric feature (kernel feature) is proposed as the input of MS-CNN, which plays the role of enrich the original features. At the same time, the multi-motion feature is proposed, which allows the model to learn a more robust global motion information. Finally, ablation experiments are conducted on NTU RGB+D 60 dataset, and the MS-CNN model was compared with 19 and 8 action recognition models on NTU RGB+D 60 dataset and NTU RGB+D 120 dataset, respectively. The ablation experimental results show that the MS-CNN model using joint features fused with kernel features has higher recognition accuracy than fused with core features; In addition, the recognition accuracy of the MS-CNN model improves with the increase of multi-motion features. The comparison experimental results show that the MS-CNN model outperforms most of the comparison models (including the benchmark AIF-CNN model) in terms of recognition accuracy under the 2 evaluation strategies.
  • 图  1   NTU RGB+D 60数据集中的3D人体骨架

    Figure  1.   Human 3D skeleton of NTU RGB+D 60 dataset

    图  2   MS-CNN整体模型结构

    Figure  2.   The overall structure of MS-CNN model

    图  3   MS-CNN模型的核心网络结构

    Figure  3.   The core network structure of MS-CNN model

    表  1   kernel特征在NTU RGB+D 60数据集上的识别准确率

    Table  1   The recognition accuracy of kernel feature on NTU RGB+D 60 dataset %

    序号 对比模型 识别准确率
    Cross-View评估策略 Cross-Subject评估策略
    1 AIF-CNN(joint+core,motion_1) 94.7 88.0
    2 AIF-CNN(joint+core,motion_1+motion_2) 95.0 88.2
    3 AIF-CNN(joint+core,motion_1+motion_2+motion_3) 95.2 88.9
    4 MS-CNN(joint+kernel,motion_1) 95.0 88.1
    5 MS-CNN(joint+kernel,motion_1+motion_2) 95.2 88.4
    6 MS-CNN(joint+kernel,motion_1+motion_2+motion_3) 95.4 89.1
    下载: 导出CSV

    表  2   多运动特征在NTU RGB+D 60数据集上的识别准确率

    Table  2   The recognition accuracy of multi-motion feature on NTU RGB+D 60 dataset %

    序号 对比模型 识别准确率
    Cross-View评估策略 Cross-Subject评估策略
    1 AIF-CNN(merge,motion_1) 95.6 89.0
    2 AIF-CNN(merge,motion_1+motion_2) 95.8 89.8
    3 AIF-CNN(merge,motion_1+motion_2+motion_3) 96.0 90.3
    4 MS-CNN(merge,motion_1) 95.6 89.0
    5 MS-CNN(merge,motion_1+motion_2) 95.8 89.8
    6 MS-CNN(merge,motion_1+motion_2+motion_3) 96.1 90.4
    下载: 导出CSV

    表  3   20个模型在NTU RGB+D 60数据集上的识别准确率

    Table  3   The recognition accuracy of 20 models on NTU RGB+D 60 dataset %

    序号 对比模型 识别准确率
    Cross-View评估策略 Cross-Subject评估策略
    1 Deep LSTM[5] 67.3 60.7
    2 Two-Stream RNN[14] 79.5 71.3
    3 STA-LSTM[15] 81.2 73.4
    4 Ensemble TS-LSTM[16] 81.3 74.6
    5 VA-LSTM[6] 87.6 79.4
    6 BGC-LSTM[17] 89.0 81.8
    7 ST-GCN[10] 86.3 74.9
    8 AS-GCN[11] 94.2 86.8
    9 2S-AGCN[18] 95.1 88.5
    10 DGNN[19] 96.1 89.9
    11 GCN-NAS[20] 95.7 89.4
    12 CGCN[21] 96.4 90.3
    13 DC-GCN+ADG[22] 96.6 90.8
    14 Res-TCN[23] 83.1 74.3
    15 Clips+CNN+MTLN[24] 84.8 79.6
    16 HCN[8] 91.1 86.5
    17 SR-TSL[25] 92.4 84.8
    18 VA-CNN[6] 94.3 88.7
    19 AIF-CNN[9] 95.6 89.9
    20 MS-CNN 96.1 90.4
    下载: 导出CSV

    表  4   9个模型在NTU RGB+D 120数据集上的识别准确率

    Table  4   The recognition accuracy of 9 models on NTU RGB+D 120 dataset %

    序号 对比模型 识别准确率
    Cross-View评估策略 Cross-Subject评估策略
    1 Spatio-Temporal LSTM[26] 57.9 55.7
    2 GCA-LSTM[27] 59.2 58.3
    3 Multi-Task CNN with RotClips[28] 61.8 62.2
    4 Logsig-RNN[29] 67.2 68.3
    5 Gimme Signals[30] 71.6 70.8
    6 GVFE+AS-GCN with DH-TCN[31] 79.8 78.3
    7 Skele Motion[32] 67.7 66.9
    8 TSRJI[33] 67.9 62.8
    9 MS-CNN 80.3 80.0
    下载: 导出CSV
  • [1] 钱慧芳, 易剑平, 付云虎. 基于深度学习的人体动作识别综述[J]. 计算机科学与探索, 2021, 15(3): 438-455. https://www.cnki.com.cn/Article/CJFDTOTAL-KXTS202103004.htm

    QIAN H F, YI J P, FU Y H. Review of human action recognition based on deep learning[J]. Journal of Frontiers of Computer Science and Technology, 2021, 15(3): 438-455. https://www.cnki.com.cn/Article/CJFDTOTAL-KXTS202103004.htm

    [2] 牛雨晴, 苏维均, 于重重, 等. 基于TX2环境的智能监控实时行为识别[J]. 信息技术与信息化, 2021(4): 243-245. doi: 10.3969/j.issn.1672-9528.2021.04.079
    [3] 张庆宾, 丁娜娜, 吴海波. 基于BP神经网络的摔倒动作识别方法[J]. 指挥信息系统与技术, 2021, 12(1): 60-64. https://www.cnki.com.cn/Article/CJFDTOTAL-ZHXT202101011.htm

    ZHANG Q B, DING N N, WU H B. Fall recognition method based on BP neural network[J]. Command Information System and Technology, 2021, 12(1): 60-64. https://www.cnki.com.cn/Article/CJFDTOTAL-ZHXT202101011.htm

    [4]

    JOHANSSON G. Visual perception of biological motion and a model for its analysis[J]. Perception & Psycho-physics, 1973, 14(2): 201-211.

    [5]

    SHAHROUDY A, LIU J, NG T T, et al. NTU RGB+d: a large scale dataset for 3D human activity analysis[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016: 1010-1019.

    [6]

    ZHANG P, LAN C, XING J, et al. View adaptive neural networks for high performance Skeleton-based human action recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019, 41(8): 1963-1978. doi: 10.1109/TPAMI.2019.2896631

    [7]

    DU Y, FU Y, WANG L. Skeleton based action recognition with convolutional neural network[C]//Proceedings of the 3rd IAPR Asian Conference on Pattern Recognition. Kuala Lumpur: IEEE, 2015: 579-583.

    [8]

    LI C, ZHONG Q, XIE D, et al. Co-occurrence feature learning from Skeleton data for action recognition and detection with hierarchical aggregation[C]//Proceedings of the 27th International Joint Conference on Artificial Intelligence. Stockholm: IJCAI, 2018: 786-792.

    [9]

    SU H, CHANG Z, YU M, et al. Convolutional neural network with adaptive inferential framework for Skeleton-based action recognition[J]. Journal of Visual Communication and Image Representation, 2020, 73: 102925/1-8.

    [10]

    YAN S, XIONG Y, LIN D. Spatial temporal graph convolutional networks for Skeleton-based action recognition[C]//Proceedings of the AAAI Conference on Artificial Intelligence. New Orleans: AAAI, 2018: 7444-7452.

    [11]

    LI M, CHEN S, CHEN X, et al. Actional-structural graph convolutional networks for Skeleton-based action recognition[C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Long Beach: IEEE, 2019: 3595-3603.

    [12]

    CHENG K, ZHANG Y, HE X, et al. Skeleton-based action recognition with shift graph convolutional network[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020: 183-192.

    [13]

    LIU J, SHAHROUDY A, PEREZ M, et al. NTU RGB+d 120: a large-scale benchmark for 3D human activity understanding[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019, 42(10): 2684-2701.

    [14]

    WANG H, WANG L. Modeling temporal dynamics and spatial configurations of actions using two-stream recurrent neural networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Honolulu: IEEE, 2017: 499-508.

    [15]

    SONG S, LAN C, XING J, et al. An end-to-end spatio-temporal attention model for human action recognition from Skeleton data[C]//Proceedings of the AAAI Conference on Artificial Intelligence. San Francisco: AAAI, 2017: 4263-4270.

    [16]

    LEE I, KIM D, KANG S, et al. Ensemble deep learning for Skeleton-based action recognition using temporal sliding LSTM networks[C]//Proceedings of the IEEE International Conference on Computer Vision. Venice: IEEE, 2017: 1012-1020.

    [17]

    ZHAO R, WANG K, SU H, et al. Bayesian graph convolution LSTM for Skeleton based action recognition[C]//Proceedings of the IEEE International Conference on Computer Vision. Seoul: IEEE, 2019: 6882-6892.

    [18]

    SHI L, ZHANG Y, CHENG J, et al. Two-stream adaptive graph convolutional networks for Skeleton-based action recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Long Beach: IEEE, 2019: 12026-12035.

    [19]

    SHI L, ZHANG Y, CHENG J, et al. Skeleton-based action recognition with directed graph neural networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Long Beach: IEEE, 2019: 7912-7921.

    [20]

    PENG W, HONG X, CHEN H, et al. Learning graph con-volutional network for Skeleton-based human action recognition by neural searching[C]//Proceedings of the AAAI Conference on Artificial Intelligence. New York: AAAI, 2020: 2669-2676.

    [21]

    YANG D, LI M M, FU H, et al. Centrality graph convolutional networks for Skeleton-based action recognition[J/OL]. (2020-03-06)[2020-10-15]. arXiv. http://doi.org/10.48550/arXiv.2003.03007.

    [22]

    CHENG K, ZHANG Y, CAO C, et al. Decoupling GCN with dropgraph module for Skeleton-based action recognition[C]//Decoupling GCN with dropgraph module for Skeleton-based action recognition. Glasgow: Springer, 2020: 536-553.

    [23]

    KIM T S, REITER A. Interpretable 3D human action analysis with temporal convolutional networks[C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. Honolulu: IEEE, 2017: 1623-1631.

    [24]

    KE Q, BENNAMOUN M, AN S, et al. A new representation of Skeleton sequences for 3D action recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Honolulu: IEEE, 2017: 3288-3297.

    [25]

    SI C, JING Y, WANG W, et al. Skeleton-based action recognition with spatial reasoning and temporal stack learning[C]//Proceedings of the European Conference on Computer Vision. Munich: Springer, 2018: 103-118.

    [26]

    LIU J, SHAHROUDY A, XU D, et al. Spatio-temporal LSTM with trust gates for 3D human action recognition[C]//Proceedings of the European Conference on Computer Vision. Amsterdam: Springer, 2016: 816-833.

    [27]

    LIU J, GANG W, PING H, et al. Global context-aware attention LSTM networks for 3D action recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Honolulu: IEEE, 2017: 1647-1656.

    [28]

    KE Q, BENNAMOUN M, AN S, et al. Learning clip representations for Skeleton-based 3D action recognition[J]. IEEE Transactions on Image Processing. Piscataway: IEEE, 2018: 27(6): 2842-2855.

    [29]

    LIAO S, LYONS T, YANG W, et al. Learning stochastic differential equations using RNN with log signature features[J/OL]. (2019-08-22)[ 2021-10-15]. arXiv. https://doi.org/10.48550/arXiv.1908.08286.

    [30]

    MEMMESHEIMER R, THEISEN N, PAULUS D. Gimme signals: discriminative signal encoding for multimodal activity recognition[C]//Proceedings of the IEEE International Conference on Intelligent Robots and Systems(IROS). Las Vegas: IEEE, 2020: 10394-10401.

    [31]

    PAPADOPOULOS K, GHORBEL E, AOUAD D, et al. Vertex feature encoding and hierarchical temporal modeling in a spatio-temporal graph convolutional network for action recognition[C]//Proceedings of the 25th International Conference on Pattern Recognition. Milan: IEEE, 2021: 452-458.

    [32]

    CAETANO C, SENA J, FRANCOIS B, et al. Skele Motion: a new representation of Skeleton Joint sequences based on motion information for 3D action recognition[C]//Proceedings of the International Conference on Advanced Video and Signal-based Surveillance(AVSS). Taipei, China: IEEE, 2019: 1-8.

    [33]

    CAETANO C, BREMOND F, SCHWARTZ W R. Skeletonimage representation for 3D action recognition based on tree structure and reference joints[C]//Proceedings of the 32nd SIBGRAPI Conference on Graphics, Patterns and Images, 2019 32nd SIBGRAPI. Janeiro: IEEE, 2019: 16-23.

  • 期刊类型引用(1)

    1. 张磊,姚林,陈晓华. 基于网格维数的徽州传统村落空间格局研究. 黄山学院学报. 2019(05): 70-74 . 百度学术

    其他类型引用(1)

图(3)  /  表(4)
计量
  • 文章访问数:  330
  • HTML全文浏览量:  78
  • PDF下载量:  90
  • 被引次数: 2
出版历程
  • 收稿日期:  2021-10-14
  • 网络出版日期:  2023-04-11
  • 刊出日期:  2023-02-24

目录

    /

    返回文章
    返回