Abstract:
Text semantic hashing refers to the neural techniques that encode texts into low-dimensional binary codes under the semantic similarity constraints. Since the hashing codes support the Hamming distance-based retrieval, it is efficient to compute the text similarity on massive data. There are many challenges on the text semantic hashing technologies, such as how to embed the category information into low-dimensional binary codes, how to enrich the semantic information to improve model robustness and how to optimize the model for the discrete coding space. The important progresses on the text semantic hashing techniques are firstly reviewed, and the technical details of methods are discussed, including the unsupervised text semantic hashing models with text reconstruction and the supervised text semantic hashing models with integrating categorical information. Additionally, the key techniques such as semantic enhancement techniques based on neighbor information and latent topic information and model optimization techniques are analyzed. The datasets on text semantic hashing and the evaluation metrics related to the text semantic hashing task are also summarized, based on which the performances of different text semantic hashing methods are compared. Finally, the future research directions are discussed.