Abstract:
With the growth of data in traditional relational databases, the probability of the similar data is increasing greatly. By using CityHash function to get fingerprint characteristic value, the Simhash algorithm is improved in order to detect the duplicate data. It has been tested by real data from petition business in the district government of Guangzhou city, the results show that it has higher recall and precision than other algorithms. Moreover, an index classification method to improve the speed of similarity detection for all data is presented. Meanwhile, the method provides a guarantee for the efficient processing of incremental data on the premise of the fingerprint values stored by MongoDB database. It also improves the whole process of similarity detection and provides a reference for scientific investigators.