摘要: |
针对中文网页文本分类中特征降维方法和传统信息增益方法的缺陷和不足做出优化改进,旨在有效提高文本分类效率和精度.首先,采取词性过滤和同义词归并处理对特征项进行初次特征降维,然后提出改进的信息增益方法对特征项进行特征加权运算,最后采用支持向量机(SVM)分类算法对中文网页进行文本分类.理论分析和实验结果都表明本方法比传统方法具有更好的性能和分类效果. |
关键词: 信息增益方法 词性过滤 同义词归并 特征加权 支持向量机 |
DOI: |
分类号: |
基金项目:上海市教育委员会科研创新项目(09YZ154) |
|
Research on Chinese web page SVM classifer based on information gain |
PAN Zhengcai, CHEN Haiguang
|
College of Information,Mechanical and Electrical Engineering,Shanghai Normal University
|
Abstract: |
In order to improve the efficiency and accuracy of text classification,optimization and improvement are made for defects and deficiencies of the feature dimensionality reduction method and traditional information gain method in text classification of Chinese web pages.At first,part-of-speech filtering and synonyms merging processes are taken for the first feature dimension reduction of feature items.Then,an improved information gain method is proposed for feature weighting computation of feature items.Finally,the classification algorithm of Support Vector Machine (SVM) is used for text classification of Chinese web pages.Both theoretical analysis and experimental results show that this method has better performance and classification results than traditional method. |
Key words: information gain method part-of-speech filtering synonyms merging feature weighting Support Vector Machine |