摘要: |
随着网络的高速发展,其信息资源越来越庞大,面对巨量的信息库,搜索引擎起着重要的作用.主题爬虫技术作为搜索引擎的主要核心部分,计算搜索结果与搜索主题的关系,该关系被称为相关性.一般主题爬虫方法只计算网页内容与搜索主题的相关性,作者所提主题爬虫,通过链接内容和锚文本内容计算链接的重要性,然后利用贝叶斯分类器对链接进行分类,最后利用余弦相似函数计算网页的相关性,如果相关值大于阀值,则认为该网页与预定主题相关,否则不相关.实验结果证明:所提出主题爬虫方法可以获得很高的精确度. |
关键词: 主题爬虫 贝叶斯分类器 网页相关性 |
DOI: |
分类号: |
基金项目: |
|
Fcoused crawler bused on Bayesian classifier |
JIA Haijun, CHEN Haiguang
|
College of Information ,Mechanical and Electrical Engineering,Shanghai Normal University
|
Abstract: |
With the rapid development of the network,its information resources are increasingly large and faced a huge amount of information database,search engine plays an important role.Focused crawling technique,as the main core portion of search engine,is used to calculate the relationship between search results and search topics,which is called correlation.Normally,focused crawling method is used only to calculate the correlation between web content and search related topics.In this paper,focused crawling method is used to compute the importance of links through link content and anchor text,then Bayesian classifier is used to classify the links,and finally cosine similarity function is used to calculate the relevance of web pages.If the correlation value is greater than the threshold the page is considered to be associated with the predetermined topics,otherwise not relevant.Experimental results show that a high accuracy can be obtained by using the proposed crawling approach. |
Key words: focused crawler Bayesian classifier relevance of web pages |