如果妳祈求心灵的平和与快乐，就去信仰上帝！如果妳希望成为一个真理的门徒，探索吧！！ -- 尼采
I can calculate the motions of havenly bodies, but not the madness of people. -- Newton
You have to be out to be in.

搜索引擎

Java, Web, Searching Engine

IT博客 :: 首页 :: 新随笔 :: 联系 :: 聚合

:: 管理 ::

24 随笔 :: 25 文章 :: 27 评论 :: 0 Trackbacks

Nutch学习笔记之一：数据结构

1. WebDB or web database

It persists as long as the web graph that is being crawled (and re-crawled) exists.
The WebDB stores two types of entities: pages and links.

A page represents a page on the Web, and is indexed by its URL and the MD5 hash of
its contents. Other pertinent information is stored, too, including the number of
links in the page (also called outlinks); fetch information (such as when the page
is due to be refetched); and the page's score, which is a measure of how important
the page is (for example, one measure of importance awards high scores to pages that
are linked to from many other pages).

A link represents a link from one web page (the source) to another (the target).

In the WebDB web graph, the nodes are pages and the edges are links.

2. segment
A segment is a collection of pages fetched and indexed by the crawler in a single run.

The fetchlist for a segment is a list of URLs for the crawler to fetch, and is generated
from the WebDB. The fetcher output is the data retrieved from the pages in the fetchlist.
The fetcher output for the segment is indexed and the index is stored in the segment.

Any given segment has a limited lifespan, since it is obsolete as soon as all of its
pages have been re-crawled. The default re-fetch interval is 30 days, so it is usually
a good idea to delete segments older than this, particularly as they take up so much
disk space.

Segments are named by the date and time they were created, so it's easy to tell how old
they are.

3. index
The index is the inverted index of all of the pages the system has retrieved, and is
created by merging all of the individual segment indexes. Nutch uses Lucene for its
indexing, so all of the Lucene tools and APIs are available to interact with the
generated index.
Since this has the potential to cause confusion, it is worth mentioning that the Lucene
index format has a concept of segments, too, nd these are different from Nutch segments. a

A Lucene segment is a portion of a Lucene index, whereas a Nutch segment is a fetched
and indexed portion of the WebDB.

posted on 2007-10-03 21:30 专心练剑阅读(768) 评论(0) 编辑收藏引用所属分类: 搜索引擎

只有注册用户登录后才能发表评论。

搜索引擎

常用链接

留言簿(4)

文章分类(26)

AI

opensource

Vertical Search

面经

搜索

最新评论

评论排行榜