1. A complete
crawl procedure can be presented by the following pseudo-code:
inject: pass links of urls file to webDB
for (i = 0; i < depth; i++) {
generate: creat a new segment and generate a fetchlist from the WebDB;
fetch: fetch content from URLs in the new fetchlist;
parse: parse content of the new segment;
updatedb: add new links in the crawldb according to the new segment;
}
invertlinks: create the linkdb, listing incoming links for each url;
index: create indexes for segments;
dedup: delete duplicate documents for each indexes segment;
merge: merge all indexes into single index corresponding;
2. Nutch provide a set of utility commands, there are:
for webDB:
readdb: Read utility
mergedb: merger
convdb: old version converter
for linkdb:
readlinkdb: Read utility
mergelinkdb: merger
for segment:
readseg: Read utility
mergesegs: merger
3. Besides those above, there two system commands:
plugin: registry of plugin
server: a search server
4. Here is a complete list of all commands and their simple description
命令
|
input
|
output |
task |
crawl |
urls dir
|
all |
do whole thing in single command |
inject |
urls dir |
webDB |
pass links of urls file to webDB |
generate |
webDB |
a segment |
creat a new segment and generate a fetchlist from the WebDB |
freegen |
urls dir
|
a segment
|
creat a new segment and generate a fetchlist from a plain text |
fetch |
a segment |
a segment |
fetch content from URLs in the fetchlist |
fetch2 |
a segment |
a segment |
Another fether |
parse |
a segment |
a segment |
Parse content in a segment |
updatedb |
a segment |
webDB |
add new links into the crawldb according to new segment |
invertlinks |
segments |
linkdb |
maintains an inverted link map, listing incoming links for each url |
index |
segments, linkdb, webDB |
indexes |
Create indexes for segments |
dedup |
indexes dir |
indexes dir |
Delete duplicate documents in a set of Lucene indexes |
merge |
indexes dir
|
index |
merge all indexes into single index |
readdb |
webDB |
information about webDB |
Read utility for the webDB |
mergedb |
webDBs |
webDB |
merge several webDB |
readlinkdb |
linkdb |
information about linkdb |
Read utility for the linkdb |
mergelinkdb |
linkdb
|
linkdb
|
merge several linkdb
|
readseg |
segment |
information about segment |
Read utility for the segment |
mergesegs |
segment
|
segment
|
merge several segment
|
convdb
|
webDB |
webDB
|
convert old webDB into new version
|
plugin |
plugin class |
NA |
register a plugin |
server |
port, indexdir |
NA |
run a search server |