摘要: |
提出一种基于二阶隐马尔可夫模型(HMM)的新闻分类算法,旨在提取新闻内容中的类别字,构成特征词集合.以该特征词集合作为不同二阶HMM分类器的观察序列,二阶HMM的隐藏状态反映了文档中词语之间的相关性差异,每个状态表示出现在语料库中的词语的相关性水平.实验结果表明,相比k近邻(kNN)、朴素贝叶斯(Naive Bayes)以及支持向量机(SVM)算法,二阶HMM算法的分类表现更显优势. |
关键词: 新闻分类 二阶隐马尔可夫模型(HMM) 词频率-逆向文件频率 χ2检验 特征词 |
DOI:10.3969/J.ISSN.1000-5137.2018.04.016 |
分类号:TP391 |
基金项目: |
|
News classification algorithm based on second order Hidden Markov Model |
Sun Xuan, Li Luqun, Jiang Longquan
|
The College of Information, Mechanical and Electrical Engineering, Shanghai Normal University, Shanghai 200234, China
|
Abstract: |
A novel algorithm based on second order Hidden Markov Model (HMM) was proposed to classify the documents of news,aiming to extract categorical feature words from news contents as a feature set.The feature set was considered as the observation sequence of different second order HMM classifiers,and the hidden state of which reflected the differences between the words in the relevant documents,and each state of which represented correlation of words occurring in the corpus.The experiment showed that the proposed classification algorithm based second order HMM had prominent advantage over k-Nearest Neighbor (kNN),Naive Bayes and Support Vector Machine (SVM) algorithms. |
Key words: news classification second order Hidden Markov Model (HMM) term frequency-inverse document frequency χ2 test feature word |