Crawling is a cyclical process: the crawler generates a set of fetchlists from the WebDB,
a set of fetchers downloads the content from the Web, the crawler updates the WebDB with
new links that were found, and then the crawler generates a new set of fetchlists (for
links that haven't been fetched for a given period, including the new links found in the
previous cycle) and the cycle repeats.
This cycle is often referred to as the generate/fetch/update cycle, and runs periodically
as long as you want to keep your search index up to date.
URLs with the same host are always assigned to the same fetchlist. This is done for reasons
of politeness, so that a web site is not overloaded with requests from multiple fetchers in
rapid succession.
Nutch observes the Robots Exclusion Protocol, which allows site owners to control which
parts of their site may be crawled.
The crawl tool is actually a front end to other, lower-level tools, so it is possible to
get the same results by running the lower-level tools in a particular sequence. Here is
a breakdown of what crawl does, with the lower-level tool names in parentheses:
1. Create a new WebDB (admin db -create).
2. Inject root URLs into the WebDB (inject). creat initial link set
3. Generate a fetchlist from the WebDB in a new segment (generate).
4. Fetch content from URLs in the fetchlist (fetch).
5. Update the WebDB with links from fetched pages (updatedb).
6. Repeat steps 3-5 until the required depth is reached.
7. Update segments with scores and links from the WebDB (updatesegs).
8. Index the fetched pages (index).
9. Eliminate duplicate content (and duplicate URLs) from the indexes (dedup).
10. Merge the indexes into a single index for searching (merge).
After creating a new WebDB (step 1), the generate/fetch/update cycle (steps 3-6) is
bootstrapped by populating the WebDB with some seed URLs (step 2). When this cycle
has finished, the crawler goes on to create an index from all of the segments (steps 7-10).
Each segment is indexed independently (step 8), before duplicate pages (that is, pages
at different URLs with the same content) are removed (step 9). Finally, the individual
indexes are combined into a single index (step 10).
The dedup tool can remove duplicate URLs from the segment indexes. This is not to remove
multiple fetches of the same URL because the URL has been duplicated in the WebDB--this
cannot happen, since the WebDB does not allow duplicate URL entries. Instead, duplicates
can arise if a URL is re-fetched and the old segment for the previous fetch still exists
(because it hasn't been deleted). This situation can't arise during a single run of the
crawl tool, but it can during re-crawls, so this is why dedup also removes duplicate URLs.
While the crawl tool is a great way to get started with crawling websites, you will need
to use the lower-level tools to perform re-crawls and other maintenance on the data
structures built during the initial crawl. We shall see how to do this in the real-world
example later, in part two of this series. Also, crawl is really aimed at intranet-scale
crawling. To do a whole web crawl, you should start with the lower-level tools. (See the
"Resources" section for more information.)