基于概念语义相关性和LDA的文本标记算法

Text Labeling Algorithm Based on Conceptual-Semantic Relatedness and LDA

  • 摘要: 为了提高文本标记和分类的效率,提出了基于概念语义相关性和LDA的文本自动标记算法(Text Mark Label,TML),用以代替人工标记的文本分类标记. 该算法在概念语义相关性计算的基础上,使用LDA(Latent Dirichlet Allocation)提取文本的主题表示,通过计算文本主题从属于各分类目录的期望从而实现文本自动标记. 为验证TML算法的效果,在标准文本分类数据集上使用文本分类器进行有监督文本分类实验. 为对比数据集和分类器对分类效果的影响,在3个数据集(WebKB、Reuters-21578、20-NewsGroup)上分别使用3种不同的分类器(Rocchio、KNN、SVM)进行实验. 实验结果表明:TML算法有效地提高了文本分类效率及文本标记效率.

     

    Abstract: In order to improve the efficiency of text labeling and classification, an automatic text labeling algorithm based on conceptual-semantic relatedness and LDA called TML is proposed. This algorithm can be used to replace manual labeling of text classification tags. The proposed algorithm is based on computing the semantic relatedness between concepts, using LDA (Latent Dirichlet Allocation) to extract the topic representation of texts and then using the results to complete automatic text labeling by computing the expectation that the topic of the text belongs to a certain category. To verify the effectiveness of the TML algorithm, text classifier was used on the standard text categorization data set for supervised text categorization experiments. Three different classifiers (Rocchio, KNN, SVM) were used to perform experiments on three datasets (WebKB, Reuters-21578, and 20-NewsGroup). The experimental results show that the TML algorithm can effectively improve the efficiency of text classification and text labeling.

     

/

返回文章
返回