文本语义哈希技术研究进展

孙宇清; 黄钿; 李呈韬; 郑威; 汤庸

doi:10.6054/j.jscnun.2024041

文本语义哈希技术研究进展

Survey on Text Semantic Hashing Technology

摘要

摘要: 文本语义哈希是在满足语义相似性约束下将文本转化为低维二值数据的神经编码技术，支持基于汉明距离的高效检索，以解决有限计算资源约束下海量文本的相似性计算问题。文本语义哈希技术存在诸多挑战，包括如何在低维二值编码中融入类别信息、如何丰富编码的语义信息以提升模型鲁棒性、如何解决离散输出的模型梯度估计等关键问题。文章首先综述文本语义哈希任务的重要研究发展，详细讨论了无监督文本语义哈希模型和融合类别信息的有监督文本语义哈希模型的技术细节，分析基于近邻文本、隐式主题等信息的语义增强技术以及模型优化等关键技术；然后，综述文本语义哈希任务相关数据集和评估指标，对比了各类文本语义哈希技术的特点和性能；最后，讨论了文本语义哈希技术的未来发展方向。

Abstract: Text semantic hashing refers to the neural techniques that encode texts into low-dimensional binary codes under the semantic similarity constraints. Since the hashing codes support the Hamming distance-based retrieval, it is efficient to compute the text similarity on massive data. There are many challenges on the text semantic hashing technologies, such as how to embed the category information into low-dimensional binary codes, how to enrich the semantic information to improve model robustness and how to optimize the model for the discrete coding space. The important progresses on the text semantic hashing techniques are firstly reviewed, and the technical details of methods are discussed, including the unsupervised text semantic hashing models with text reconstruction and the supervised text semantic hashing models with integrating categorical information. Additionally, the key techniques such as semantic enhancement techniques based on neighbor information and latent topic information and model optimization techniques are analyzed. The datasets on text semantic hashing and the evaluation metrics related to the text semantic hashing task are also summarized, based on which the performances of different text semantic hashing methods are compared. Finally, the future research directions are discussed.

HTML全文

参考文献(50)

施引文献

资源附件(0)