如果妳祈求心灵的平和与快乐,就去信仰上帝!如果妳希望成为一个真理的门徒,探索吧!! -- 尼采
I can calculate the motions of havenly bodies, but not the madness of people. -- Newton
You have to be out to be in.

搜索引擎

Java, Web, Searching Engine

  IT博客 :: 首页 :: 新随笔 :: 联系 :: 聚合  :: 管理 ::
  24 随笔 :: 25 文章 :: 27 评论 :: 0 Trackbacks
Crawling is a cyclical process: the crawler generates a set of fetchlists from the WebDB,
a set of fetchers downloads the content from the Web, the crawler updates the WebDB with
new links that were found, and then the crawler generates a new set of fetchlists (for
links that haven't been fetched for a given period, including the new links found in the
previous cycle) and the cycle repeats.

This cycle is often referred to as the generate/fetch/update cycle, and runs periodically
as long as you want to keep your search index up to date.

URLs with the same host are always assigned to the same fetchlist. This is done for reasons
of politeness, so that a web site is not overloaded with requests from multiple fetchers in
rapid succession.

Nutch observes the Robots Exclusion Protocol, which allows site owners to control which
parts of their site may be crawled.

The crawl tool is actually a front end to other, lower-level tools, so it is possible to
get the same results by running the lower-level tools in a particular sequence. Here is
a breakdown of what crawl does, with the lower-level tool names in parentheses:

   1. Create a new WebDB (admin db -create).
   2. Inject root URLs into the WebDB (inject). creat initial link set
  
   3. Generate a fetchlist from the WebDB in a new segment (generate).
   4. Fetch content from URLs in the fetchlist (fetch).
   5. Update the WebDB with links from fetched pages (updatedb).

   6. Repeat steps 3-5 until the required depth is reached.

   7. Update segments with scores and links from the WebDB (updatesegs).

   8. Index the fetched pages (index).
   9. Eliminate duplicate content (and duplicate URLs) from the indexes (dedup).

  10. Merge the indexes into a single index for searching (merge).


After creating a new WebDB (step 1), the generate/fetch/update cycle (steps 3-6) is
bootstrapped by populating the WebDB with some seed URLs (step 2). When this cycle
has finished, the crawler goes on to create an index from all of the segments (steps 7-10).

Each segment is indexed independently (step 8), before duplicate pages (that is, pages
at different URLs with the same content) are removed (step 9). Finally, the individual
indexes are combined into a single index (step 10).

The dedup tool can remove duplicate URLs from the segment indexes. This is not to remove
multiple fetches of the same URL because the URL has been duplicated in the WebDB--this
cannot happen, since the WebDB does not allow duplicate URL entries. Instead, duplicates
can arise if a URL is re-fetched and the old segment for the previous fetch still exists
(because it hasn't been deleted). This situation can't arise during a single run of the
crawl tool, but it can during re-crawls, so this is why dedup also removes duplicate URLs.

While the crawl tool is a great way to get started with crawling websites, you will need
to use the lower-level tools to perform re-crawls and other maintenance on the data
structures built during the initial crawl. We shall see how to do this in the real-world
example later, in part two of this series. Also, crawl is really aimed at intranet-scale
crawling. To do a whole web crawl, you should start with the lower-level tools. (See the
"Resources" section for more information.)


posted on 2007-10-03 21:36 专心练剑 阅读(383) 评论(0)  编辑 收藏 引用 所属分类: 搜索引擎
只有注册用户登录后才能发表评论。