摘要: |
Nutch的网页更新预测方法采用的是邻比法,相关更新参数需要人为设定,不能自适应调整,无法应对海量网页更新的差异性.为解决这个问题,提出动态选择策略对Nutch的网页更新预测方法进行改进.该策略在网页更新历史数据不足时,通过基于MapReduce的DBSCAN聚类算法来减少爬虫系统抓取网页数量,将样本网页的更新周期作为所属类其他网页的更新周期;在网页更新历史数据较多时,通过对网页更新历史数据进行泊松过程建模,较准确地预测每个网页的更新周期.最后在Hadoop分布式平台下对改进该策略测试.实验结果表明,优化后的网页更新预测方法表现更优. |
关键词: Nutch 网页更新预测 基于密度聚类算法 泊松过程 分布式编程 |
DOI: |
分类号: |
基金项目: |
|
Research and optimization of page updated forecast on Nutch |
HU Wei, WU Haitao
|
College of Informiation,Mechanical and Electrical Engineering,Shanghai Normal University
|
Abstract: |
Web page updated prediction method of Nutch is an adjacent method and its relevant update parameters need to be set artificially,not adaptively adjustable,and unable to cope with the differences of massive web page updates.To address this problem,this paper puts forward dynamic selection strategy to improve the method of Nutch web page updated prediction.When the historical updated web page data are insufficient,the strategy uses DBSCAN clustering algorithm based on MapReduce to reduce the number of the pages of the crawler system crawling,the update cycle of the sample web pages is used as update cycle of other pages which are in the same category.When the historical updated web page data are enough,the data are used to model with the Poisson Process,which can more accurately predict each web page update cycle.Finally the improving strategy is tested in the Hadoop distributed platform.The experimental results show that the performance of optimized web page updated prediction method is better. |
Key words: Nutch web page updated prediction DBSCAN poisson process mapReduce |