关系型数据库数据的高效判重

李恒新; 韩坚华*

doi:10.6054/j.jscnun.2014.11.004

关系型数据库数据的高效判重

(广东工业大学计算机学院,广州 510006)

基金项目:

国家自然科学基金项目(61142012)

详细信息

中图分类号: TP391
计量
- 文章访问数: 1442
- HTML全文浏览量: 111
- PDF下载量: 1412
出版历程
- 收稿日期: 2014-09-18
- 刊出日期: 2015-03-19

Efficient Duplicate Detection for Data in Relational Databases

School of Computer Science,Guangdong University of Technology,Guangzhou 510006,China

摘要

摘要: 对Simhash算法进行改进，用CityHash函数生成数据指纹特征值，以此对数据进行判重.在广州市某区政府的信访业务真实数据下进行了实验，实验结果相对其他算法具有较高的召回率和准确率.并提出了一种索引归类方法来提高全部数据一次性相似检测的速度，在MongoDB数据库存储指纹值的前提下，为增量数据的高效判重处理提供了保障．通过对数据的整个判重过程的改进，达到了高效、实用的价值，为科学办案、重复办案提供了参考依据．
- Simhash /
- CityHash /
- MongoDB /
- 指纹特征值 /
- 相似检测
Abstract: With the growth of data in traditional relational databases, the probability of the similar data is increasing greatly. By using CityHash function to get fingerprint characteristic value, the Simhash algorithm is improved in order to detect the duplicate data. It has been tested by real data from petition business in the district government of Guangzhou city, the results show that it has higher recall and precision than other algorithms. Moreover, an index classification method to improve the speed of similarity detection for all data is presented. Meanwhile, the method provides a guarantee for the efficient processing of incremental data on the premise of the fingerprint values stored by MongoDB database. It also improves the whole process of similarity detection and provides a reference for scientific investigators.
- Simhash /
- CityHash /
- MongoDB /
- fingerprint characteristic value /
- similarity detection

HTML全文

参考文献(1)

施引文献(10)

期刊类型引用(4)

1.	王诚, 王宇成. 基于Simhash的大规模文档去重改进算法研究. 计算机技术与发展. 2019(02): 115-119 . 百度学术
2.	冉崇善, 邵春霞. Simhash算法在试题查重中的应用. 软件导刊. 2018(02): 151-153+157 . 百度学术
3.	陈春玲, 陈琳, 熊晶, 余瀚. 基于Simhash算法的重复数据删除技术的研究与改进. 南京邮电大学学报(自然科学版). 2016(03): 85-91 . 百度学术
4.	陈惠娟, 冯月春, 陈亮. 基于单表结构的Web动态树设计与实现. 软件导刊. 2016(11): 170-172 . 百度学术

其他类型引用(6)

资源附件(0)

计量

文章访问数: 1442
HTML全文浏览量: 111
PDF下载量: 1412
被引次数: 10

关系型数据库数据的高效判重

计量

出版历程

Efficient Duplicate Detection for Data in Relational Databases

期刊类型引用(4)

其他类型引用(6)

计量

出版历程

目录