摘要: |
针对文本分类存在的高维文本问题,提出文档频率(DF)-卡方统计量特征提取方式,对特征项进行有效约减,降低文本维度,提高分类精度.在K最近邻(KNN)算法的基础上,针对待分类文本需要和大量训练集样本进行相似度计算的问题,提出一种基于分组中心向量的KNN算法,对类别内的样本集分组求出各组中心向量,使其重新代表训练库计算相似度,降低计算复杂度,提升算法的分类性能.通过实验表明:相较传统KNN算法,改进的算法在准确率、召回率及F值方面都有提升,与其他分类算法相比,具有一定的优势. |
关键词: 文本分类 K最近邻(KNN)算法 特征提取 相似度 |
DOI:10.3969/J.ISSN.1000-5137.2019.01.017 |
分类号: |
基金项目: |
|
Chinese text classification based on improved K Nearest Neighbor algorithm |
HUANG Chao, CHEN Junhua
|
College of Information, Mechanical and Electrical Engineering, Shanghai Normal University, Shanghai 200234, China
|
Abstract: |
This paper focuses on the high dimensional text problems encountered in text classification.Document frequency(DF)-chi square statistic feature extraction method is proposed to reduce the feature items and reduce the dimension of text.Based on the K Nearest Neighbor(KNN) algorithm,in view of the problem that text to be classified should be calculated in similarity with a large number of training set samples,a KNN algorithm based on grouping center vector is proposed.The center vectors of each group were obtained by grouping the sample sets in the category,so as to improve the classification performance of the algorithm.Experiments show that the improved algorithm has improved the precision rate,recall rate and F-measure compared with the traditional KNN algorithm,and it takes advantages of other classification algorithms. |
Key words: text classification K Nearest Neighbor(KNN)algorithm feature extraction similarity |