关系型数据库数据的高效判重

Efficient Duplicate Detection for Data in Relational Databases

摘要: 对Simhash算法进行改进，用CityHash函数生成数据指纹特征值，以此对数据进行判重.在广州市某区政府的信访业务真实数据下进行了实验，实验结果相对其他算法具有较高的召回率和准确率.并提出了一种索引归类方法来提高全部数据一次性相似检测的速度，在MongoDB数据库存储指纹值的前提下，为增量数据的高效判重处理提供了保障．通过对数据的整个判重过程的改进，达到了高效、实用的价值，为科学办案、重复办案提供了参考依据．

Abstract: With the growth of data in traditional relational databases, the probability of the similar data is increasing greatly. By using CityHash function to get fingerprint characteristic value, the Simhash algorithm is improved in order to detect the duplicate data. It has been tested by real data from petition business in the district government of Guangzhou city, the results show that it has higher recall and precision than other algorithms. Moreover, an index classification method to improve the speed of similarity detection for all data is presented. Meanwhile, the method provides a guarantee for the efficient processing of incremental data on the premise of the fingerprint values stored by MongoDB database. It also improves the whole process of similarity detection and provides a reference for scientific investigators.