Abstract:
Missing value imputation methods based on k-nearest neighbor typically use the distance between samples to measure the similarity of the samples and do not differentiate the weights of the attributes when calculating the distance, i.e., all attributes contribute equally to the distance. However, in a non-uniform distributed imba-lanced dataset, the heterogeneity of the samples is often reflected in the attributes with uncommon values, and the similarity between the samples is affected by the probability of the attributes' values, and the similarity calculated by traditional distance formula is not accurate enough at this time. Therefore, an adaptive k-nearest neighbor missing value imputation method named AkNNI is proposed in the article for non-uniformly distributed imbalanced datasets. Firstly, the probability density of the attributes is introduced to dynamically adjust the importance of each attribute, highlighting the contribution of sparse values and reducing the contribution of frequent values in the calculation of distances, so as to better express the heterogeneity of samples as well as capture the similarity between samples; then, for the case of scarcity of complete samples in the dataset under high missing rates, the new selection process of k-nearest neighbors is designed by considering the sample similarity and completeness together. Experiments were conducted to select six non-uniformly distributed datasets, compare the imputation effect of the AkNNI method with other five classical imputation methods, verify the classification effect of the imputed datasets in the k-nearest neighbor classifier, and also explore the interrelationships of the three evaluation metrics in depth. The experimental results demonstrate that AkNNI method has higher imputation accuracy and classification accuracy: among the six missing value imputation methods, the AkNNI method achieves the lowest average RMSE, the highest average Pearson correlation coefficient, and the highest average classification accuracy on each dataset. Meanwhile, AkNNI still maintains lower RMSE, higher Pearson's correlation coefficient, and higher classification accuracy at high missing rates on each dataset.