摘要: |
为提高文本分类的准确性,针对健康节目台词文本各类别之间样本数量及各样本之间词数不平衡的特点,提出了一种基于word2vec均值算法及改进的词频-逆文本频率(TF-IDF)算法的分类方法.该方法通过引入信息熵及修正因子,缓解了数据不平衡对分类准确率及召回率造成的不良影响.实验结果表明:所提出的分类方法在准确率及召回率上与word2vec均值模型相比,分别提高7.3%及10.5%. |
关键词: 词频-逆文本频率(TF-IDF) word2vec 信息熵 文本分类 机器学习 加权 |
DOI:10.3969/J.ISSN.1000-5137.2020.01.014 |
分类号:TP181 |
基金项目:上海市科研计划项目(17DZ2292100) |
|
Research on line text classification based on TF-IDF and word2vec |
DAN Yuhao1, HUANG Jifeng1, YANG Lin2, GAO Hai3
|
1.College of Information, Mechanical and Electrical Engineering, Shanghai Normal University, Shanghai 201418, China;2.Shanghai Development Center of Computer Software Technology, Shanghai 201112, China;3.Shanghai Gaochuang Computer Technology Co., Ltd., Shanghai 200030, China
|
Abstract: |
In order to improve the classification accuracy of line text,a classification method based on word2vec average algorithm and improved term frequency-inverse document frequency(TF-IDF) algorithm was proposed,which took into account the characteristic of unbalanced sample quantity and word number among different categories of line text for health TV programs.By introducing information entropy and correction factors,the adverse impact of data imbalance on classification accuracy and recall rate was alleviated.The experimental results showed that the classification accuracy and recall rate of the proposed method were improved by 7.3% and 10.5% respectively compared with the word2vec average model. |
Key words: term frequency-inverse document frequency (TF-IDF) word2vec information entropy text classification machine learning weight |