crawl完成后,就可以部署到tomcat,提供搜索引擎服务了。步骤如下:
1. 安装WAR文件 将WAR文件$nutch$/nutch-*.war拷贝到目录$tomcat$/webapps/,
cp $nutch$/nutch-*.war $tomcat$/webapps/nutch.war
这样就可以通过URL: http://127.0.0.1:8080/nutch 来打开搜索主页面
如果是保存为ROOT.war, 对应的URL为http://127.0.0.1:8080
cp $nutch$/nutch-*.war $tomcat$/webapps/ROOT.war
2. 指定搜索数据目录 需要为搜索服务程序指定数据文件的位置。
假设WAR文件保存为nutch.war,重启动Tomcat,解压缩成目录$tomcat$/webapps/nutch/。
打开文件$tomcat$/webapps/nutch/WEB-INF/classes/nutch-site.xml,添加searcher.dir
属性,例如数据文件保存在/local/nutch/crawl目录中,则添加:
<property>
<name>searcher.dir</name>
<value>/local/nutch/crawl</value>
</property>
这样search.jsp就知道数据文件的在哪里了。
3. 让Tomcat支持中文输入 如果要用中文词汇做为关键词来搜索,Tomcat必须要支持中文输入。为此必须修改tomcat的
配置文件$tomcat$/conf/server.xml, 在端口8080上的Connector中加入两个属性
URIEncoding 和
useBodyEncodingForURI。代码如下:
<Connector port="8080" maxHttpHeaderSize="8192"
maxThreads="150" minSpareThreads="25" maxSpareThreads="75"
enableLookups="false" redirectPort="8443" acceptCount="100"
connectionTimeout="20000" disableUploadTimeout="true"
URIEncoding="UTF-8" useBodyEncodingForURI="true"/>
4. 如果要搜索大型网站,例如网络门户,还需要修改一些配置,因为缺省配置是搜索intranet的。
修改db.max.outlinks.per.page,它定义一个网页的最大link数,超过此数的链接都要被忽略掉。缺省是100,改为1000足够了。
<property>
<name>db.max.outlinks.per.page</name>
<value>1000</value>
<description>The maximum number of outlinks that we'll process for a page.
If this value is nonnegative (>=0), at most db.max.outlinks.per.page outlinks
will be processed for a page; otherwise, all outlinks will be processed.
</description>
</property>
修改urlfilter.order,指定URL过滤器的顺序。作者比较喜欢用正则表达式,所以设置为org.apache.nutch.urlfilter.regex.RegexURLFilter。
<property>
<name>urlfilter.order</name>
<value>org.apache.nutch.urlfilter.regex.RegexURLFilter</value>
<description>The order by which url filters are applied.
If empty, all available url filters (as dictated by properties
plugin-includes and plugin-excludes above) are loaded and applied in system
defined order. If not empty, only named filters are loaded and applied
in given order. For example, if this property has value:
org.apache.nutch.urlfilter.regex.RegexURLFilter org.apache.nutch.urlfilter.prefix.PrefixURLFilter
then RegexURLFilter is applied first, and PrefixURLFilter second.
Since all filters are AND'ed, filter ordering does not have impact
on end result, but it may have performance implication, depending
on relative expensiveness of filters.
</description>
</property>
5. 再次重启Tomcat 用浏览器打开URL: "http://127.0.0.1:8080/nutch", 大功告成,现在开始enjoy nutch。