博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
nutch(或者网络爬虫)的几点建议
阅读量:7134 次
发布时间:2019-06-28

本文共 4814 字,大约阅读时间需要 16 分钟。

hot3.png

Here are the things that could potentially slow down fetching

1) DNS setup

2) The number of crawlers you have, too many, too few.

3) Bandwidth limitations

4) Number of threads per host (politeness)

5) Uneven distribution of urls to fetch and politeness.

6) High crawl-delays from robots.txt (usually along with an uneven distribution of urls).

7) Many slow websites (again usually with an uneven distribution).

8) Downloading lots of content (PDFS, very large html pages, again possibly an uneven distribution).

9) Others

Now how do we fix them

1) Have a DNS setup on each local crawling machine, if multiple crawling machines and a single centralized DNS it can act like a DOS attack on the DNS server slowing the entire system. We always did a two layer setup hitting first to the local DNS cache then to a large DNS cache like OpenDNS or Verizon.

2) This would be number of map tasks * fetcher.threads.fetch. So 10 map tasks * 20 threads = 200 fetchers at once. Too many and you overload your system, too few and other factors and the machine sites idle. You will need to play around with this setting for your setup.

3) Bandwidth limitations. Use ntop, ganglia, and other monitoring tools to determine how much bandwidth you are using. Account for in and out bandwidth. A simple test, from a server inside the fetching network but not itself fetching, if it is very slow connecting to or downloading content when fetching is occurring, it is a good bet you are maxing out bandwidth. If you set http timeout as we describe later and are maxing your bandwidth, you will start seeing many http timeout errors.

4) Politeness along with uneven distribution of urls is probably the biggest limiting factor. If one thread is processing a single site and there are a lot of urls from that site to fetch all other threads will sit idle while that one thread finishes. Some solutions, use fetcher.server.delay to shorten the time between page fetches and use fetcher.threads.per.host to increase the number of threads fetching for a single site (this would still be in the same map task though and hence the same JVM ChildTask process). If increasing this > 0 you could also set fetcher.server.min.delay to some value > 0 for politeness to min and max bound the process.

5) Fetching a lot of pages from a single site or a lot of pages from a few sites will slow down fetching dramatically. For full web crawls you want an even distribution so all fetching threads can be active. Setting generate.max.per.host to a value > 0 will limit the number of pages from a single host/domain to fetch.

6) Crawl-delay can be used and is obeyed by nutch in robots.txt. Most sites don't use this setting but a few (some malicious do). I have seen crawl-delays as high as 2 days in seconds. The fetcher.max.crawl.delay variable will ignore pages with crawl delays > x. I usually set this to 10 seconds, default is 30. Even at 10 seconds if you have a lot of pages from a site from which you can only crawl 1 page every 10 seconds it is going to be slow. On the flip side, setting this to a low value will ignore and not fetch those pages.

7) Sometimes, manytimes websites are just slow. Setting a low value for http.timeout helps. The default is 10 seconds. If you don't care and want as many pages as fast as possible, set it lower. Some websites, digg for instance, will bandwidth limit you on their side only allowing x connections per given time frame. So even if you only have say 50 pages from a single site (which I still think is to many). It may be waiting 10 seconds on each page. The ftp.timeout can also be set if fetching ftp content. 

8) Lots of content means slower fetching. If downloading PDFs and other non-html documents this is especially true. To avoid non-html content you can use the url filters. I prefer the prefix and suffix filters. The http.content.limit and ftp.content.limit can be used to limit the amount of content downloaded for a single document.

9) Other things that could be causing slow fetching:

  • Max the number of open sockets/files on a machine. You will start seeing IO errors or can't open socket errors.
  • Poor routing. Bad routers or home routers might not be able to handle the number of connections going through at once. An incorrect routing setup could also be causing problems but those are usually much more complex to diagnose. Use network trace and mapping tools if you think this is happening. Upstream routing can also be a problem from your network provider.
  • Bad network cards. I have seen network cards flip once they reach a certain bandwidth point. This was more prevalent on, at the time, newer gigabit cards. Not usually my first thought but always a possibility. Use tcpdump and network monitoring tools on the single interface.

转载于:https://my.oschina.net/momohuang/blog/133484

你可能感兴趣的文章
Calendar
查看>>
XCode 6.x 越狱 真机调试
查看>>
db2 创建临时表、复制表结构、表数据
查看>>
自定义UINavigationBar 和 UIBarButtonItem(转)
查看>>
XenDesktop7实践之高可用设计①环境介绍和准备
查看>>
比MySQL快6倍 深度解析国内首个云原生数据库POLARDB的“王者荣耀”
查看>>
android id
查看>>
6426C Lab5 部署和配置联合服务
查看>>
两个ISP接入路由,双路由接入华为S5700交换机,实施策略路由
查看>>
MAC OS 启用root管理员账户
查看>>
初始Puppet—Puppet的安装和简单使用
查看>>
生成时间序列
查看>>
centos6.5 双网卡绑定
查看>>
springboot 入门教程(3)-运行原理、关键注解和配置
查看>>
我的友情链接
查看>>
准备在这里安家了
查看>>
ISA2006
查看>>
解决vSphere5.5安装之后不能添加域用户权限
查看>>
PL/SQL入门--复合类型及集合(3)
查看>>
Mybatis(2)
查看>>