原版nutch中对中文的处理是按字划分,而不是按词划分。为适应我们的使用习惯必须加上中文分词,我加的是 IKAnalyzer。下面是我的方法,我把前台和后台爬虫用两种方法实现中文分词,后台直接替换nutch的analyzer,前台则修改NutchAnalysis.jj(注:我的前台和后台是两个项目)。

前台修改:

在src/java/org/apache/nutch/analysis包下找到NutchAnalysis.jj

(1) 在 PARSER_BEGIN(NutchAnalysis)部分的导入声明中增加如下段

import org.wltea.analyzer.lucene.IKTokenizer;

    (2)在 TOKEN_MGR_DECLS : {    下面增加如下段

IKTokenizer Analyzer;

TermAttribute termAtt = null;//代表用空格分割器分出来的一个中文词
OffsetAttribute offAtt = null;//中文词开始结束标记
TokenStream stream = null;

 private int cjkStartOffset = 0;//中文片段的起始位置定义

   (3) 到 TOKEN : { 部分,找到| <SIGRAM: <CJK> >,这代表按字划分,修改为| <SIGRAM: (<CJK>)+ >

并在其后面加上

{
    if (stream == null) {
				stream  = new IKTokenizer(new StringReader(image.toString()),true);
				//stream = Analyzer.tokenStream("",new StringReader(image.toString()));
				cjkStartOffset = matchedToken.beginColumn;
				try {
					stream.reset();
				} catch (IOException e) {

					e.printStackTrace();
				}
				termAtt = (TermAttribute) stream.addAttribute(TermAttribute.class);
				offAtt = (OffsetAttribute) stream.addAttribute(OffsetAttribute.class);

				try {
					if (stream.incrementToken() == false)
						termAtt = null;
				} catch (IOException e) {
					// TODO Auto-generated catch block
					e.printStackTrace();
				}

			}

			if (termAtt != null && !termAtt.term().equals("")) {

				matchedToken.image = termAtt.term();

				matchedToken.beginColumn = cjkStartOffset + offAtt.startOffset();

				matchedToken.endColumn = cjkStartOffset + offAtt.endOffset();
				
				
				try {
					if (stream.incrementToken() != false)
						input_stream.backup(1);
					else
						termAtt = null;
				} catch (IOException e) {

					e.printStackTrace();
				}
			}

			if (termAtt == null || termAtt.term().equals("")) {

				stream = null;
				cjkStartOffset = 0;
				
			}
}

(4)用javacc工具生成NutchAnalysis.jj的源代码,将生成的所有java源代码全部覆盖到 src/java/org/apache/nutch/analysis包下.
       有异常的话抛出就行.

后台爬虫修改:

修改src/java/org/apache/nutch/analysis包下的NutchDocumentAnalyzer

在private static Analyzer ANCHOR_ANALYZER;后面加上

private static Analyzer MY_ANALYZER;

 在ANCHOR_ANALYZER = new AnchorAnalyzer();后面加上

 MY_ANALYZER = new IKAnalyzer();

把 tokenStream修改为:

public TokenStream tokenStream(String fieldName, Reader reader) {
    Analyzer analyzer;
 //注释nutch原本分词器
//if ("anchor".equals(fieldName))
 //analyzer = ANCHOR_ANALYZER;
 // else
//analyzer = CONTENT_ANALYZER;
    analyzer = MY_ANALYZER;
      
    return analyzer.tokenStream(fieldName, reader);
  }