注册 登录  
 加关注
   显示下一条  |  关闭
温馨提示!由于新浪微博认证机制调整,您的新浪微博帐号绑定已过期,请重新绑定!立即重新绑定新浪微博》  |  关闭

Koala++'s blog

计算广告学 RTB

 
 
 

日志

 
 

Nutch 1.0 源代码分析[10] Indexer  

2010-03-24 18:46:08|  分类: Nutch |  标签: |举报 |字号 订阅

  下载LOFTER 我的照片书  |

         Crawl接下来就是建索引了:

indexer.index(indexes, crawlDb, linkDb, Arrays.asList(HadoopFSUtil

       .getPaths(fstats)));

         Index代码如下:

public void index(Path luceneDir, Path crawlDb, Path linkDb,

       List<Path> segments) throws IOException {

    final JobConf job = new NutchJob(getConf());

 

    IndexerMapReduce.initMRJob(crawlDb, linkDb, segments, job);

 

    FileOutputFormat.setOutputPath(job, luceneDir);

 

    LuceneWriter.addFieldOptions("segment", LuceneWriter.STORE.YES,

           LuceneWriter.INDEX.NO, job);

    LuceneWriter.addFieldOptions("digest", LuceneWriter.STORE.YES,

           LuceneWriter.INDEX.NO, job);

    LuceneWriter.addFieldOptions("boost", LuceneWriter.STORE.YES,

           LuceneWriter.INDEX.NO, job);

 

    NutchIndexWriterFactory.addClassToConf(job, LuceneWriter.class);

 

    JobClient.runJob(job);

}

         这里已经可以看到一些 Lucene的影子了,Field的名字,是否保存,是否索引。

initMRJob的代码如下:

public static void initMRJob(Path crawlDb, Path linkDb,

       Collection<Path> segments, JobConf job) {

    for (final Path segment : segments) {

       LOG.info("IndexerMapReduces: adding segment: " + segment);

       FileInputFormat.addInputPath(job, new Path(segment,

              CrawlDatum.FETCH_DIR_NAME));

       FileInputFormat.addInputPath(job, new Path(segment,

              CrawlDatum.PARSE_DIR_NAME));

       FileInputFormat.addInputPath(job, new Path(segment,

              ParseData.DIR_NAME));

       FileInputFormat.addInputPath(job, new Path(segment,

              ParseText.DIR_NAME));

    }

 

    FileInputFormat.addInputPath(job, new Path(crawlDb,

           CrawlDb.CURRENT_NAME));

    FileInputFormat

           .addInputPath(job, new Path(linkDb, LinkDb.CURRENT_NAME));

    job.setInputFormat(SequenceFileInputFormat.class);

 

    job.setMapperClass(IndexerMapReduce.class);

    job.setReducerClass(IndexerMapReduce.class);

 

    job.setOutputFormat(IndexerOutputFormat.class);

    job.setOutputKeyClass(Text.class);

    job.setMapOutputValueClass(NutchWritable.class);

    job.setOutputValueClass(NutchWritable.class);

}

         将每个segment下的,crawl_fetchcrawl_parseparse_dataparse_text加入输入目录,将currentlinkdb也加入输入目录,mapreduce函数都在这个类中实现。

          Map将同一URL的数据合到一起:

public void map(Text key, Writable value,

       OutputCollector<Text, NutchWritable> output, Reporter reporter)

       throws IOException {

    output.collect(key, new NutchWritable(value));

}

         reduce拆开来看:

Inlinks inlinks = null;

CrawlDatum dbDatum = null;

CrawlDatum fetchDatum = null;

ParseData parseData = null;

ParseText parseText = null;

while (values.hasNext()) {

    final Writable value = values.next().get(); // unwrap

    if (value instanceof Inlinks) {

       inlinks = (Inlinks) value;

    } else if (value instanceof CrawlDatum) {

       final CrawlDatum datum = (CrawlDatum) value;

       if (CrawlDatum.hasDbStatus(datum))

           dbDatum = datum;

       else if (CrawlDatum.hasFetchStatus(datum)) {

           // don't index unmodified (empty) pages

           if (datum.getStatus() !=

                 CrawlDatum.STATUS_FETCH_NOTMODIFIED)

              fetchDatum = datum;

       } else if (CrawlDatum.STATUS_LINKED == datum.getStatus()

              || CrawlDatum.STATUS_SIGNATURE == datum.getStatus()) {

           continue;

       }

    } else if (value instanceof ParseData) {

       parseData = (ParseData) value;

    } else if (value instanceof ParseText) {

       parseText = (ParseText) value;

    }

}

         判断value是什么类型的,如果是CrawlDatum有可能是dbDatum也可能是fetchDatum

NutchDocument doc = new NutchDocument();

final Metadata metadata = parseData.getContentMeta();

 

// add segment, used to map from merged index back to segment files

doc.add("segment", metadata.get(Nutch.SEGMENT_NAME_KEY));

 

// add digest, used by dedup

doc.add("digest", metadata.get(Nutch.SIGNATURE_KEY));

 

final Parse parse = new ParseImpl(parseText, parseData);

try {

    // extract information from dbDatum and pass it to

    // fetchDatum so that indexing filters can use it

    final Text url = (Text) dbDatum.getMetaData().get(

           Nutch.WRITABLE_REPR_URL_KEY);

    if (url != null) {

       fetchDatum.getMetaData().put(Nutch.WRITABLE_REPR_URL_KEY,

 url);

    }

    // run indexing filters

    doc = this.filters.filter(doc, parse, key, fetchDatum, inlinks);

}

 

float boost = 1.0f;

// run scoring filters

try {

    boost = this.scfilters.indexerScore(key, doc, dbDatum, fetchDatum,

           parse, inlinks, boost);

}

// apply boost to all indexed fields.

doc.setScore(boost);

// store boost for use by explain and dedup

doc.add("boost", Float.toString(boost));

 

output.collect(key, doc);

         Fieldsegment名字为保存在元数据中的SEGMENT_NAME_KEYFielddigest为元数据中的SIGNATURE_KEY,它用于去重。将URL得到,再对这个文档进行过滤,再计算得分,返回keyURLvalueNutchDocument

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

  评论这张
 
阅读(1872)| 评论(1)
推荐 转载

历史上的今天

评论

<#--最新日志,群博日志--> <#--推荐日志--> <#--引用记录--> <#--博主推荐--> <#--随机阅读--> <#--首页推荐--> <#--历史上的今天--> <#--被推荐日志--> <#--上一篇,下一篇--> <#-- 热度 --> <#-- 网易新闻广告 --> <#--右边模块结构--> <#--评论模块结构--> <#--引用模块结构--> <#--博主发起的投票-->
 
 
 
 
 
 
 
 
 
 
 
 
 
 

页脚

网易公司版权所有 ©1997-2017