Rapid Retrieval:      
引用本文:
【打印本页】   【下载PDF全文】   View/Add Comment  【EndNote】   【RefMan】   【BibTex】
←前一篇|后一篇→ 过刊浏览    高级检索
本文已被:浏览 1298次   下载 1181 本文二维码信息
码上扫一扫!
分享到: 微信 更多
基于改进K最近邻算法的中文文本分类
黄超, 陈军华
上海师范大学 信息与机电工程学院, 上海 200234
摘要:
针对文本分类存在的高维文本问题,提出文档频率(DF)-卡方统计量特征提取方式,对特征项进行有效约减,降低文本维度,提高分类精度.在K最近邻(KNN)算法的基础上,针对待分类文本需要和大量训练集样本进行相似度计算的问题,提出一种基于分组中心向量的KNN算法,对类别内的样本集分组求出各组中心向量,使其重新代表训练库计算相似度,降低计算复杂度,提升算法的分类性能.通过实验表明:相较传统KNN算法,改进的算法在准确率、召回率及F值方面都有提升,与其他分类算法相比,具有一定的优势.
关键词:  文本分类  K最近邻(KNN)算法  特征提取  相似度
DOI:10.3969/J.ISSN.1000-5137.2019.01.017
分类号:
基金项目:
Chinese text classification based on improved K Nearest Neighbor algorithm
HUANG Chao, CHEN Junhua
College of Information, Mechanical and Electrical Engineering, Shanghai Normal University, Shanghai 200234, China
Abstract:
This paper focuses on the high dimensional text problems encountered in text classification.Document frequency(DF)-chi square statistic feature extraction method is proposed to reduce the feature items and reduce the dimension of text.Based on the K Nearest Neighbor(KNN) algorithm,in view of the problem that text to be classified should be calculated in similarity with a large number of training set samples,a KNN algorithm based on grouping center vector is proposed.The center vectors of each group were obtained by grouping the sample sets in the category,so as to improve the classification performance of the algorithm.Experiments show that the improved algorithm has improved the precision rate,recall rate and F-measure compared with the traditional KNN algorithm,and it takes advantages of other classification algorithms.
Key words:  text classification  K Nearest Neighbor(KNN)algorithm  feature extraction  similarity