Rapid Retrieval:      
引用本文:
【打印本页】   【下载PDF全文】   View/Add Comment  【EndNote】   【RefMan】   【BibTex】
←前一篇|后一篇→ 过刊浏览    高级检索
本文已被:浏览 1456次   下载 1893 本文二维码信息
码上扫一扫!
分享到: 微信 更多
Nutch中网页更新预测研究与优化
胡伟, 吴海涛
上海师范大学
摘要:
Nutch的网页更新预测方法采用的是邻比法,相关更新参数需要人为设定,不能自适应调整,无法应对海量网页更新的差异性.为解决这个问题,提出动态选择策略对Nutch的网页更新预测方法进行改进.该策略在网页更新历史数据不足时,通过基于MapReduce的DBSCAN聚类算法来减少爬虫系统抓取网页数量,将样本网页的更新周期作为所属类其他网页的更新周期;在网页更新历史数据较多时,通过对网页更新历史数据进行泊松过程建模,较准确地预测每个网页的更新周期.最后在Hadoop分布式平台下对改进该策略测试.实验结果表明,优化后的网页更新预测方法表现更优.
关键词:  Nutch  网页更新预测  基于密度聚类算法  泊松过程  分布式编程
DOI:
分类号:
基金项目:
Research and optimization of page updated forecast on Nutch
HU Wei, WU Haitao
College of Informiation,Mechanical and Electrical Engineering,Shanghai Normal University
Abstract:
Web page updated prediction method of Nutch is an adjacent method and its relevant update parameters need to be set artificially,not adaptively adjustable,and unable to cope with the differences of massive web page updates.To address this problem,this paper puts forward dynamic selection strategy to improve the method of Nutch web page updated prediction.When the historical updated web page data are insufficient,the strategy uses DBSCAN clustering algorithm based on MapReduce to reduce the number of the pages of the crawler system crawling,the update cycle of the sample web pages is used as update cycle of other pages which are in the same category.When the historical updated web page data are enough,the data are used to model with the Poisson Process,which can more accurately predict each web page update cycle.Finally the improving strategy is tested in the Hadoop distributed platform.The experimental results show that the performance of optimized web page updated prediction method is better.
Key words:  Nutch  web page updated prediction  DBSCAN  poisson process  mapReduce