留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

基于批损失的跨模态检索

刘爽 乔晗 徐清振

刘爽, 乔晗, 徐清振. 基于批损失的跨模态检索[J]. 华南师范大学学报(自然科学版), 2021, 53(6): 115-121. doi: 10.6054/j.jscnun.2021101
引用本文: 刘爽, 乔晗, 徐清振. 基于批损失的跨模态检索[J]. 华南师范大学学报(自然科学版), 2021, 53(6): 115-121. doi: 10.6054/j.jscnun.2021101
LIU Shuang, QIAO Han, XU Qingzhen. The Cross-modal Retrieval Based on Batch Loss[J]. Journal of South China normal University (Natural Science Edition), 2021, 53(6): 115-121. doi: 10.6054/j.jscnun.2021101
Citation: LIU Shuang, QIAO Han, XU Qingzhen. The Cross-modal Retrieval Based on Batch Loss[J]. Journal of South China normal University (Natural Science Edition), 2021, 53(6): 115-121. doi: 10.6054/j.jscnun.2021101

基于批损失的跨模态检索

doi: 10.6054/j.jscnun.2021101
基金项目: 

广东省科技攻关计划项目 201903010103

详细信息
    通讯作者:

    徐清振,Email: 20061040@m.scnu.edu.cn

  • 中图分类号: TP391

The Cross-modal Retrieval Based on Batch Loss

  • 摘要: 针对跨模态检索中成对或三元组样本的方法构造了高度冗余且信息量少的样本对问题,提出了基于批损失的跨模态检索方法(BLCMR):首先,引入批损失,考虑了嵌入样本的相似性,有效地保持了跨模态样本的不变性;然后,引入迭代方法来修正预测的类别标签,有效地区分了样本的语义类别信息. 在3个公开的数据集(Wikipedia、Pascal Sentence和NUS-WIDE-10k)上的实验结果表明:BLCMR方法能够拉近跨模态样本间的距离,有效地提升最终的跨模态检索精度.
  • 图  1  BLCMR方法的总体框架

    Figure  1.  The general architecture of the BLCMR method

    图  2  Wikipedia数据集上总损失值的变化曲线

    Figure  2.  The curve of total loss value on Wikipedia dataset

    图  3  Wikipedia数据集上的图像、文本样本的可视化

    注:相同颜色的样本具有相同的语义类别.

    Figure  3.  The visualization of the image and text samples on Wikipedia dataset

    表  1  数据集的划分

    Table  1.   The partitioning of datasets

    数据集 Ntrain Ntest
    Wikipedia 2 173 462
    Pascal Sentence 800 100
    NUS-WIDE-10k 8 000 1 000
    注:NtrainNtest分别为训练、测试实例数.
    下载: 导出CSV

    表  2  跨模态检索的性能

    Table  2.   The performance of cross-modal retrieval

    方法 Wikipedia Pascal Sentence NUS-WIDE-10k
    Img2Txt Txt2Img Avg Img2Txt Txt2Img Avg Img2Txt Txt2Img Avg
    CCA 0.134 0.133 0.134 0.225 0.227 0.226 0.378 0.394 0.386
    MCCA 0.341 0.307 0.324 0.664 0.689 0.677 0.448 0.462 0.456
    MvDA 0.337 0.308 0.323 0.594 0.626 0.610 0.501 0.526 0.513
    MvDA-VC 0.388 0.358 0.373 0.648 0.673 0.661 0.526 0.557 0.542
    JRL 0.449 0.418 0.434 0.527 0.534 0.531 0.586 0.598 0.592
    DCCA 0.444 0.396 0.420 0.678 0.677 0.677 0.532 0.549 0.540
    DCCAE 0.435 0.385 0.410 0.680 0.671 0.675 0.511 0.540 0.525
    CMDN 0.487 0.427 0.457 0.544 0.526 0.535 0.492 0.515 0.504
    CCL 0.504 0.457 0.481 0.576 0.561 0.569 0.506 0.535 0.521
    BDTR 0.492 0.465 0.478 0.648 0.670 0.659 0.570 0.586 0.578
    ACMR 0.460 0.450 0.455 0.658 0.664 0.661 0.590 0.595 0.592
    GSS-SL 0.504 0.461 0.483 0.624 0.623 0.623 0.542 0.557 0.550
    CM-GANs 0.521 0.466 0.494 0.603 0.604 0.604
    BLCMR 0.507 0.479 0.493 0.687 0.691 0.689 0.582 0.606 0.594
    下载: 导出CSV

    表  3  BLCMR方法及其L1L2L3的检索性能

    Table  3.   The retrieval performance of the BLCMR method and its L1, L2 and L3

    方法 Wikipedia Pascal Sentence NUS-WIDE-10k
    Img2Txt Txt2Img Avg Img2Txt Txt2Img Avg Img2Txt Txt2Img Avg
    L1 0.484 0.464 0.474 0.661 0.661 0.661 0.575 0.593 0.584
    L2 0.184 0.154 0.169 0.147 0.139 0.143 0.137 0.138 0.138
    L3 0.281 0.162 0.222 0.493 0.273 0.383 0.331 0.221 0.276
    BLCMR 0.507 0.479 0.493 0.687 0.691 0.689 0.582 0.606 0.594
    下载: 导出CSV
  • [1] WANG B K, YANG Y, XU X. Adversarial cross-modal retrieval[C]//Proceedings of the 2017 ACM on Multimedia Conference. Mountain View: ACM, 2017: 154-162.
    [2] HOTELLING H. Relations between two sets of variates[J]. Biometrika, 1935, 28: 321-377. http://www.onacademic.com/detail/journal_1000036334687710_32bd.html
    [3] RUPNIK J, SHAWE-TAYLOR J. Multi-view canonical correlation analysis[C]//Proceedings of the Conference on Data Mining and Data Warehouses. [S. l. : s. n. ], 2010: 1-4.
    [4] KAN M, SHAN S G, ZHANG H K, et al. Multi-view discriminant analysis[C]//Proceedings of the 12th European Conference on Computer Vision. Florence: Springer, 2012: 808-821.
    [5] KAN M, SHAN S, ZHANG H K, et al. Multi-view discriminant analysis[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016, 38(1): 188-194. doi: 10.1109/TPAMI.2015.2435740
    [6] ZHAI X H, PENG Y X, XIAO J G. Learning cross-media joint representation with sparse and semisupervised regularization[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2014, 24(6): 965-978. doi: 10.1109/TCSVT.2013.2276704
    [7] AKAHO S. A kernel method for canonical correlation analysis[C]//Proceedings of the International Meeting of Psychometric Society. [S. l. : s. n. ], 2001: 263-269.
    [8] ANDREW G, ARORA R, BILMES J A, et al. Deep canonical correlation analysis[C]//Proceedings of the 30th International Conference on Machine Learning. Atlanta: PMLR, 2013: 1247-1255.
    [9] WANG W R, ARORA R, LIVESCU K, et al. On deep multi-view representation learning[C]//Proceedings of the 32nd International Conference on Machine Learning. Lille: PMLR, 2015: 1083-1092.
    [10] PENG Y X, HUANG X, QI J W. Cross-media shared representation by hierarchical learning with multiple deep networks[C]//Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence. New York: IJCAI, 2016: 3846-3853.
    [11] PENG Y X, QI J W, HUANG X, et al. CCL: cross-modal correlation learning with multigrained fusion by hierarchical network[J]. IEEE Transactions on Multimedia, 2018, 20(2): 405-420. doi: 10.1109/TMM.2017.2742704
    [12] ZHENG L, MA B P, LI G R, et al. Generalized semi-supervised and structured subspace learning for cross-modal retrieval[J]. IEEE Transactions on Multimedia, 2018, 20(1): 128-141. doi: 10.1109/TMM.2017.2723841
    [13] PENG Y X, HUANG X, ZHAO Y Z. An overview of cross- media retrieval: concepts, methodologies, benchmarks and challenges[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2018, 28(9): 2372-2385. doi: 10.1109/TCSVT.2017.2705068
    [14] PENG Y X, QI J W. CM-GANs: cross-modal generative adversarial networks for common representation[J]. ACM Transactions on Multimedia Computing Communications and Applications, 2019, 15(1): 1-24. http://arxiv.org/pdf/1710.05106
    [15] MIKOLOV T, SUTSKEVER I, CHEN K, et al. Distributed representations of words and phrases and their compositionality[C]//Proceedings of the Advances in Neural Information Processing Systems. Lake Tahoe: [s. n. ], 2013: 3111-3119.
    [16] KIM Y, MOSCHITTI A, PANG B, et al. Convolutional neural networks for sentence classification[C]//Procee-dings of the 2014 Conference on Empirical Methods in Natural Language Processing. Doha: ACL, 2014: 1746-1751.
    [17] ZHEN L L, HU P, WANG X, et al. Deep supervised cross-modal retrieval[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Long Beach: IEEE, 2019: 10394-10403.
    [18] WANG X, HAN X T, HUANG W L, et al. Multi-similarity loss with general pair weighting for deep metric learning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Long Beach: IEEE, 2019: 5022-5030.
    [19] ELEZI I, VASCON S, TORCINOVICH A, et al. The group loss for deep metric learning[C]//Proceedings of the 16th European Conference Computer Vision. Switzerland: Springer, 2020: 277-294.
    [20] WEIBULL J W. Evolutionary game theory[M]. Massachusetts: MIT Press, 1997.
    [21] ROSENFELD A, HUMMEL R A, ZUCKER S W. Scene labeling by relaxation operations[J]. IEEE Transactions on Systems, Man Cybernetics, 1976, 6(6): 420-433. http://ieeexplore.ieee.org/iel5/21/4309513/04309519.pdf
    [22] PELILLO M. The dynamics of nonlinear relaxation labeling processes[J]. Journal of Mathematical Imaging and Vision, 1997, 7(4): 309-323. doi: 10.1023/A:1008255111261
    [23] PEREIRA J C, COVIELLO E, DOYLE G, et al. On the role of correlation and abstraction in cross-modal multimedia retrieval[J]. IEEE Transations on Pattern Analysis and Machine Intelligence, 2014, 36(3): 521-535. doi: 10.1109/TPAMI.2013.142
    [24] RASHTCHIAN C, YOUNG P, HODOSH M, et al. Collecting image annotations using Amazon's Mechanical Turk[C]//Proceedings of the 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk. Los Angeles: Association for Computational Linguistics, 2010: 139-147.
    [25] CHUA T P, TANG J H, HONG R C, et al. NUS-WIDE: a real-world web image database from National University of Singapore[C]//Proceedings of the 8th ACM International Conference on Image and Video Retrieval. Santorini Island: ACM, 2009: 1-9.
    [26] FENG F X, WANG X J, LI R F. Cross-modal retrieval with correspondence autoencoder[C]//Proceedings of the ACM International Conference on Multimedia. Orlando: ACM, 2014: 7-16.
    [27] KINGMA D P, BA J. Adam: a method for stochastic optimization[J/OL]. arXiv, (2014-12-22)[2021-04-26]. https://arxiv.org/abs/1412.6980v8.
    [28] BELLET A, HABRARD A, SEBBAN M. A survey on me-tric learning for feature vectors and structured data[J/OL]. arXiv, (2013-02-12)[2021-04-26]. http://arxiv.org/abs/1306.6709.
    [29] MAATEN L V D, GEOFFREY H. Visualizing data using t-SNE[J]. Journal of Machine Learning Research, 2008, 9: 2579-2605. http://arxiv.org/abs/2108.01301v1
  • 加载中
图(3) / 表(3)
计量
  • 文章访问数:  343
  • HTML全文浏览量:  65
  • PDF下载量:  105
  • 被引次数: 0
出版历程
  • 收稿日期:  2021-05-02
  • 网络出版日期:  2022-01-10
  • 刊出日期:  2021-12-25

目录

    /

    返回文章
    返回