Abstract:
In order to improve the efficiency of text labeling and classification, an automatic text labeling algorithm based on conceptual-semantic relatedness and LDA called TML is proposed. This algorithm can be used to replace manual labeling of text classification tags. The proposed algorithm is based on computing the semantic relatedness between concepts, using LDA (Latent Dirichlet Allocation) to extract the topic representation of texts and then using the results to complete automatic text labeling by computing the expectation that the topic of the text belongs to a certain category. To verify the effectiveness of the TML algorithm, text classifier was used on the standard text categorization data set for supervised text categorization experiments. Three different classifiers (Rocchio, KNN, SVM) were used to perform experiments on three datasets (WebKB, Reuters-21578, and 20-NewsGroup). The experimental results show that the TML algorithm can effectively improve the efficiency of text classification and text labeling.