【Nutch2.3基礎教程】集成Nutch/Hadoop/Hbase/Solr構建搜索引擎:安裝及運行【集羣環境】


一、下載相關軟件,並解壓html

版本號以下:java

(1)apache-nutch-2.3web

(2) hadoop-1.2.1
express

(3)hbase-0.92.1
apache

(4)solr-4.9.0
tomcat

並解壓至/opt/jediael。bash

若要下載最新的開發版本nutch,能夠進行如下操做elasticsearch

 svn co https://svn.apache.org/repos/asf/nutch/branches/2.x

二、安裝hadoop1.2.1集羣環境ide

見http://blog.csdn.net/jediael_lu/article/details/38926477
svn

三、安裝hbase0.92.1集羣環境

見http://blog.csdn.net/jediael_lu/article/details/43086641


四、Nutch的配置

(1)vi /usr/search/apache-nutch-2.3/conf/nutch-site.xml 

<property>
<name>storage.data.store.class</name>
<value>org.apache.gora.hbase.store.HBaseStore</value>
<description>Default class for storing data</description>
</property>
<pre name="code" class="html"><property>
<name>http.agent.name</name>
<value>My Nutch Spider</value>
</property>

 (2)vi /usr/search/apache-nutch-2.3/ivy/ivy.xml  

默認狀況下,此語句被註釋掉,將其註釋符號去掉,使其生效。

<dependency org="org.apache.gora" name="gora-hbase" rev="0.5" conf="*->default" />

gora-hbase 0.5對應hbase0.94.12

根據須要修改hadoop的版本:

<dependency org="org.apache.hadoop" name="hadoop-core"  rev="1.2.1" conf="*->default」>
<dependency org="org.apache.hadoop" name="hadoop-test" rev="1.2.1" conf="test->default」>

(3)vi /usr/search/apache-nutch-2.2.1/conf/gora.properties 

添加如下語句:

gora.datastore.default=org.apache.gora.hbase.store.HBaseStore

以上三個步驟指定了使用HBase來進行存儲。


(4)根據須要修改網頁過濾器

 vi /usr/search/apache-nutch-2.3/conf/regex-urlfilter.txt 

 vi /usr/search/apache-nutch-2.3/conf/regex-urlfilter.txt 

# accept anything else
+.

修改成

# accept anything else
+^http://([a-z0-9]*\.)*nutch.apache.org/



(9)增長索引內容

默認狀況下,schema.xml文件中的core及index-basic中的field纔會被索引,爲索引更多的field,能夠經過如下方式添加。

修改nutch-default.xml,新增如下紅色內容【或者只增長index-more】

<property>

  <name>plugin.includes</name>

 <value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic|index-anchor|index-more|languageidentifier|subcollection|feed|creativecommons|tld</value> 

 <description>Regular expression naming plugin directory names to

  include. Any plugin not matching this expression is excluded.

  In any case you need at least include the nutch-extensionpoints plugin. By

  default Nutch includes crawling just HTML and plain text via HTTP,

  and basic indexing and search plugins. In order to use HTTPS please enable

  protocol-httpclient, but be aware of possible intermittent problems with the

  underlying commons-httpclient library.

  </description>

</property>

或者能夠在nutch-site.xml中添加plugin.includes屬性,並將上述內容複製過去。注意,在nutch-site.xml中的屬性會代替nutch-default.xml中的屬性,所以必須將原有的屬性也複製過去。




(5)構建runtime

 cd /usr/search/apache-nutch-2.3/

ant runtime


(6)驗證Nutch安裝完成

# cd /usr/search/apache-nutch-2.3/runtime/local/bin/
# ./nutch 
Usage: nutch COMMAND
where COMMAND is one of:
 inject         inject new urls into the database
 hostinject     creates or updates an existing host table from a text file
 generate       generate new batches to fetch from crawl db
 fetch          fetch URLs marked during generate
 parse          parse URLs marked during fetch
 updatedb       update web table after parsing
 updatehostdb   update host table after parsing
 readdb         read/dump records from page database
 readhostdb     display entries from the hostDB
 elasticindex   run the elasticsearch indexer
 solrindex      run the solr indexer on parsed batches
 solrdedup      remove duplicates from solr
 parsechecker   check the parser for a given url
 indexchecker   check the indexing filters for a given url
 plugin         load a plugin and run one of its classes main()
 nutchserver    run a (local) Nutch server on a user defined port
 junit          runs the given JUnit test
 or
 CLASSNAME      run the class named CLASSNAME
Most commands print help when invoked w/o parameters.


(7)建立seed.txt

 cd /usr/search/apache-nutch-2.3/runtime/deploy/bin/

vi seed.txt

http://nutch.apache.org/

hadoop fs -copyFromLocal seed.txt /

將seed.txt放到HDFS的根目錄下。


(8)在運行過程當中,會出現如下異常:

java.lang.RuntimeException: java.lang.ClassNotFoundException: org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat

緣由未明。爲使抓取能正常繼續,先將crawl文件中的如下語句註釋掉

#echo "SOLR dedup -> $SOLRURL"
    #__bin_nutch solrdedup $commonOptions $SOLRURL

之後找緣由。

export CLASSPATH=$CLASSPATH:.....無效。

但使用local模式運行不會有以上的錯誤。



五、Solr的配置

(1)覆蓋solr的schema.xml文件。(對於solr4,應該使用schema-solr4.xml)

cp /usr/search/apache-nutch-2.3/conf/schema.xml /usr/search/solr-4.9.0/example/solr/collection1/conf/

(2)若使用solr3.6,則至此已經完成配置,但使用4.9,須要修改如下配置:【新版本已經不須要此步驟】

修改上述複製過來的schema.xml文件

刪除:<filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt" /> 

增長:<field name="_version_" type="long" indexed="true" stored="true"/>

或者使用tomcat來運行solr,見http://blog.csdn.net/jediael_lu/article/details/37908885。


六、啓動抓取任務

(1)啓動hadoop

#start-all.sh

(2)啓動HBase
# ./start-hbase.sh 

(3)啓動Solr

[# cd /usr/search/solr-4.9.0/example/
# java -jar start.jar 

(4)啓動Nutch,開始抓取任務

將seed.txt複製到hdfs的根目錄下。

# cd /usr/search/apache-nutch-2.3/runtime/deploy
# bin/crawl /seed.txt TestCrawl http://localhost:8583/solr 2


大功告成,任務開始執行。


七、安裝過程當中可能出現的異常


異常一:No active index writer.

修改nutch-default.xml,在plugin.includes中增長indexer-solr。


異常二:ClassNotFoundException: org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat

在SolrDeleteDuplicates中的Job job = new Job(getConf(), "solrdedup");

後添加如下代碼:

job.setJarByClass(SolrDeleteDuplicates.class); 



關於上述過程的一些分析請見:

集成Nutch/Hbase/Solr構建搜索引擎之二:內容分析

http://blog.csdn.net/jediael_lu/article/details/37738569


使用crontab來設置Nutch的例行任務時,出現如下錯誤
JAVA_HOME is not set。
以及
Can't find Hadoop executable. Add HADOOP_HOME/bin to the path or run in local mode.
因而建立了一個腳本,用於執行抓取工做:
$ vi /opt/jediael/apache-nutch-2.3/runtime/deploy/bin/myCrawl.sh
#!/bin/bash
export JAVA_HOME=/usr/java/jdk1.7.0_51
export PATH=$PATH:/opt/jediael/hadoop-1.2.1/bin/
/opt/jediael/apache-nutch-2.3/runtime/deploy/bin/crawl /seed.txt `date +%h%d%H` http://master:8983/solr/ 2
而後再配置例行任務
0 0,9,12,15,19,21 * * * bash /opt/jediael/apache-nutch-2.3/runtime/deploy/bin/myCrawl.sh >> ~/nutch.log