如果妳祈求心灵的平和与快乐,就去信仰上帝!如果妳希望成为一个真理的门徒,探索吧!! -- 尼采
I can calculate the motions of havenly bodies, but not the madness of people. -- Newton
You have to be out to be in.

搜索引擎

Java, Web, Searching Engine

  IT博客 :: 首页 :: 新随笔 :: 联系 :: 聚合  :: 管理 ::
  24 随笔 :: 25 文章 :: 27 评论 :: 0 Trackbacks
crawl完成后,就可以部署到tomcat,提供搜索引擎服务了。步骤如下:

1. 安装WAR文件
   将WAR文件$nutch$/nutch-*.war拷贝到目录$tomcat$/webapps/,
   cp $nutch$/nutch-*.war $tomcat$/webapps/nutch.war
   这样就可以通过URL: http://127.0.0.1:8080/nutch 来打开搜索主页面

   如果是保存为ROOT.war, 对应的URL为http://127.0.0.1:8080
   cp $nutch$/nutch-*.war $tomcat$/webapps/ROOT.war

2. 指定搜索数据目录
   需要为搜索服务程序指定数据文件的位置。
   假设WAR文件保存为nutch.war,重启动Tomcat,解压缩成目录$tomcat$/webapps/nutch/。
   打开文件$tomcat$/webapps/nutch/WEB-INF/classes/nutch-site.xml,添加searcher.dir
   属性,例如数据文件保存在/local/nutch/crawl目录中,则添加:
   <property>
      <name>searcher.dir</name>
      <value>/local/nutch/crawl</value>
   </property>
   这样search.jsp就知道数据文件的在哪里了。

3. 让Tomcat支持中文输入
   如果要用中文词汇做为关键词来搜索,Tomcat必须要支持中文输入。为此必须修改tomcat的
   配置文件$tomcat$/conf/server.xml, 在端口8080上的Connector中加入两个属性URIEncoding
   和useBodyEncodingForURI。代码如下:
    <Connector port="8080" maxHttpHeaderSize="8192"
               maxThreads="150" minSpareThreads="25" maxSpareThreads="75"
               enableLookups="false" redirectPort="8443" acceptCount="100"
               connectionTimeout="20000" disableUploadTimeout="true"
               URIEncoding="UTF-8" useBodyEncodingForURI="true"/>

4. 如果要搜索大型网站,例如网络门户,还需要修改一些配置,因为缺省配置是搜索intranet的。
   修改db.max.outlinks.per.page,它定义一个网页的最大link数,超过此数的链接都要被忽略掉。缺省是100,改为1000足够了。
<property>
  <name>db.max.outlinks.per.page</name>
  <value>1000</value>
  <description>The maximum number of outlinks that we'll process for a page.
  If this value is nonnegative (>=0), at most db.max.outlinks.per.page outlinks
  will be processed for a page; otherwise, all outlinks will be processed.
  </description>
</property>

   修改urlfilter.order,指定URL过滤器的顺序。作者比较喜欢用正则表达式,所以设置为org.apache.nutch.urlfilter.regex.RegexURLFilter。
<property>
  <name>urlfilter.order</name>
  <value>org.apache.nutch.urlfilter.regex.RegexURLFilter</value>
  <description>The order by which url filters are applied.
  If empty, all available url filters (as dictated by properties
  plugin-includes and plugin-excludes above) are loaded and applied in system
  defined order. If not empty, only named filters are loaded and applied
  in given order. For example, if this property has value:
  org.apache.nutch.urlfilter.regex.RegexURLFilter org.apache.nutch.urlfilter.prefix.PrefixURLFilter
  then RegexURLFilter is applied first, and PrefixURLFilter second.
  Since all filters are AND'ed, filter ordering does not have impact
  on end result, but it may have performance implication, depending
  on relative expensiveness of filters.
  </description>
</property>


5. 再次重启Tomcat

   用浏览器打开URL: "http://127.0.0.1:8080/nutch", 大功告成,现在开始enjoy nutch。


posted on 2007-10-04 23:01 专心练剑 阅读(3654) 评论(14)  编辑 收藏 引用 所属分类: 搜索引擎

评论

# re: Nutch学习笔记之四:部署搜索服务(Tomcat) 2010-04-16 07:44 dissertation
You seem to be so cool and your knowledge about this good post supposes to be good enough. Should you continue your investigation? I should purchase the thesis samples and dissertation from you.   回复  更多评论
  

# re: Nutch学习笔记之四:部署搜索服务(Tomcat) 2010-04-16 13:14 resume service
Some people transpire the responsibility to professional resume writers because they don't have the talent to write a respectable resume in order that the argument why people
need to
resume writers, but such people like writer don't do that. Thanks for the information. Really useful article about this post.  回复  更多评论
  

# re: Nutch学习笔记之四:部署搜索服务(Tomcat) 2010-06-15 02:13 buy research papers
There are many drafts available for gathering an education these days,you can buy term paper or buy research paper which is intereating news for those who have not yet directed. verily, essays writing is not an easy production so try make right choice between making on yor own or to buy essays about this topic. Maybe you need help with unique audit , maybe you want help in producing a fresh intention on a topic that is vast and complex.  回复  更多评论
  

# re: Nutch学习笔记之四:部署搜索服务(Tomcat) 2010-07-30 21:57 ANTHONY27Lynn
Have bad Internet traffic and are willing to make it better? Simply look for the <a href="http://4submission.com/bookmarks.htm">social bookmarking submission services</a>, just because it really helps.   回复  更多评论
  

# re: Nutch学习笔记之四:部署搜索服务(Tomcat) 2010-10-09 09:44 nutch tutorial
A good tutorial for sharing  回复  更多评论
  

# re: Nutch学习笔记之四:部署搜索服务(Tomcat) 2011-10-02 21:03 credit loans
Some specialists say that mortgage loans aid a lot of people to live their own way, because they can feel free to buy necessary goods. Furthermore, banks present bank loan for all people.   回复  更多评论
  

只有注册用户登录后才能发表评论。