注册 登录  
 加关注
   显示下一条  |  关闭
温馨提示!由于新浪微博认证机制调整,您的新浪微博帐号绑定已过期,请重新绑定!立即重新绑定新浪微博》  |  关闭

Koala++'s blog

计算广告学 RTB

 
 
 

日志

 
 

Nutch 1.0 源代码分析[8] ParseSegment  

2010-03-24 18:42:47|  分类: Nutch |  标签: |举报 |字号 订阅

  下载LOFTER 我的照片书  |

         Crawl中,要关于解析的代码为:

if (!Fetcher.isParsing(job)) {

    parseSegment.parse(segment); // parse it, if needed

}

         如果需要解析,而解析。而parse函数的代码为:

public void parse(Path segment) throws IOException {

    JobConf job = new NutchJob(getConf());

    job.setJobName("parse " + segment);

 

    FileInputFormat.addInputPath(job, new Path(segment,

        Content.DIR_NAME));

    job.set(Nutch.SEGMENT_NAME_KEY, segment.getName());

    job.setInputFormat(SequenceFileInputFormat.class);

    job.setMapperClass(ParseSegment.class);

    job.setReducerClass(ParseSegment.class);

 

    FileOutputFormat.setOutputPath(job, segment);

    job.setOutputFormat(ParseOutputFormat.class);

    job.setOutputKeyClass(Text.class);

    job.setOutputValueClass(ParseImpl.class);

 

    JobClient.runJob(job);

}

         输入目录为Fetcher保存的content目录,mapreduce函数在ParseSegment中实现。

public void map(WritableComparable key, Content content,

       OutputCollector<Text, ParseImpl> output, Reporter reporter)

       throws IOException {

   

    ParseResult parseResult = null;

    try {

       parseResult = new ParseUtil(getConf()).parse(content);

    }

 

    for (Entry<Text, Parse> entry : parseResult) {

       Text url = entry.getKey();

       Parse parse = entry.getValue();

       ParseStatus parseStatus = parse.getData().getStatus();

 

       // pass segment name to parse data

       parse.getData().getContentMeta().set(Nutch.SEGMENT_NAME_KEY,

              getConf().get(Nutch.SEGMENT_NAME_KEY));

 

       // compute the new signature

       byte[] signature = SignatureFactory.getSignature(getConf())

              .calculate(content, parse);

       parse.getData().getContentMeta().set(Nutch.SIGNATURE_KEY,

              StringUtil.toHexString(signature));

 

       try {

           scfilters.passScoreAfterParsing(url, content, parse);

       }

       output.collect(url, new ParseImpl(

new ParseText(parse.getText()),

              parse.getData(), parse.isCanonical()));

    }

}

         这里的过程与在Fetcher中没有太多区别,计算singature,再计算解析后的得分。之所以有两次是因为:“fetcher.parse”这个属性用来判断是在fetcher网页过程中parser,还是在fetch完所有网页之后再parser[若冰]

         Reduce函数如下:

public void reduce(Text key, Iterator<Writable> values,

       OutputCollector<Text, Writable> output, Reporter reporter)

       throws IOException {

    output.collect(key, (Writable) values.next()); // collect first value

}

         它只取第一个值。

         解析后的输出格式为ParseOutputFormat,在它的getRecordWriter函数中:

Path out = FileOutputFormat.getOutputPath(job);

 

Path text = new Path(new Path(out, ParseText.DIR_NAME), name);

Path data = new Path(new Path(out, ParseData.DIR_NAME), name);

Path crawl = new Path(new Path(out, CrawlDatum.PARSE_DIR_NAME), name);

 

final MapFile.Writer textOut = new MapFile.Writer(job, fs, text

       .toString(), Text.class, ParseText.class,

       CompressionType.RECORD, progress);

 

final MapFile.Writer dataOut = new MapFile.Writer(job, fs, data

       .toString(), Text.class, ParseData.class, compType, progress);

 

final SequenceFile.Writer crawlOut = SequenceFile.createWriter(fs, job,

       crawl, Text.class, CrawlDatum.class, compType, progress);

         可以看出它分为三个文件夹,parse_textparse_datacrawl_parse

         写入解析后的文本是很简单的:

textOut.append(key, new ParseText(parse.getText()));

         把剩下的代码拆开来看:

ParseData parseData = parse.getData();

// recover the signature prepared by Fetcher or ParseSegment

String sig = parseData.getContentMeta()

       .get(Nutch.SIGNATURE_KEY);

if (sig != null) {

    byte[] signature = StringUtil.fromHexString(sig);

    if (signature != null) {

       // append a CrawlDatum with a signature

       CrawlDatum d = new CrawlDatum( CrawlDatum.STATUS_SIGNATURE, 0);

       d.setSignature(signature);

       crawlOut.append(key, d);

    }

}

         得到解析后的signature,将CrawlDatumsignature设为此值,写入到crawl_parse中去。

ParseStatus pstatus = parseData.getStatus();

if (pstatus != null

       && pstatus.isSuccess()

       && pstatus.getMinorCode() == ParseStatus.SUCCESS_REDIRECT) {

    String newUrl = pstatus.getMessage();

    int refreshTime = Integer.valueOf(pstatus.getArgs()[1]);

    newUrl = normalizers.normalize(newUrl,

           URLNormalizers.SCOPE_FETCHER);

    newUrl = filters.filter(newUrl);

    String url = key.toString();

    if (newUrl != null && !newUrl.equals(url)) {

       String reprUrl = URLUtil.chooseRepr(url, newUrl,

              refreshTime < Fetcher.PERM_REFRESH_TIME);

       CrawlDatum newDatum = new CrawlDatum();

       newDatum.setStatus(CrawlDatum.STATUS_LINKED);

       if (reprUrl != null && !reprUrl.equals(newUrl)) {

           newDatum.getMetaData().put(

                  Nutch.WRITABLE_REPR_URL_KEY,

                  new Text(reprUrl));

       }

       crawlOut.append(new Text(newUrl), newDatum);

    }

}

         这一段是对于解析成功但状态是SUCCESS_REDIRECT,重定向这种情况,它得到重定向后的url,对这个url进行规范化,如果这个url与原始的url不同,则将它记录下来。

// collect outlinks for subsequent db update

Outlink[] links = parseData.getOutlinks();

int outlinksToStore = Math.min(maxOutlinks, links.length);

if (ignoreExternalLinks) {

    try {

       fromHost = new URL(fromUrl).getHost().toLowerCase();

    } catch (MalformedURLException e) {

       fromHost = null;

    }

} else {

    fromHost = null;

}

         得到所有的外链,并得到这个页面的主机名。

int validCount = 0;

CrawlDatum adjust = null;

List<Entry<Text, CrawlDatum>> targets = new ArrayList<Entry<Text,

 CrawlDatum>>(outlinksToStore);

List<Outlink> outlinkList = new ArrayList<Outlink>(

       outlinksToStore);

for (int i = 0; i < links.length

       && validCount < outlinksToStore; i++) {

    String toUrl = links[i].getToUrl();

    // ignore links to self (or anchors within the page)

    if (fromUrl.equals(toUrl)) {

       continue;

    }

    if (ignoreExternalLinks) {

       try {

           toHost = new URL(toUrl).getHost().toLowerCase();

       }

       if (toHost == null || !toHost.equals(fromHost)) {

           continue; // skip it

       }

    }

    try {

       toUrl = normalizers.normalize(toUrl,

              URLNormalizers.SCOPE_OUTLINK); // normalize the url

       toUrl = filters.filter(toUrl); // filter the url

    }

    CrawlDatum target = new CrawlDatum(

           CrawlDatum.STATUS_LINKED, interval);

    Text targetUrl = new Text(toUrl);

    try {

       scfilters.initialScore(targetUrl, target);

    }

 

    targets.add(new SimpleEntry(targetUrl, target));

    outlinkList.add(links[i]);

    validCount++;

}

         将链接得到后,判断如果是自身,就不要了,如果选择忽略别的网站的外链,也忽略。再对URL进行规范化,过滤,最后计算它的得分。

try {

    // compute score contributions and adjustment to the original score

    adjust = scfilters.distributeScoreToOutlinks((Text) key,

           parseData, targets, null, links.length);

}

for (Entry<Text, CrawlDatum> target : targets) {

    crawlOut.append(target.getKey(), target.getValue());

}

if (adjust != null)

    crawlOut.append(key, adjust);

         adjust是计算得分的分布后,对原先计算的分数进行调整,这里把所有得到的合法外链写入,再将调整后的CrawlDatum写入。

Outlink[] filteredLinks = outlinkList

       .toArray(new Outlink[outlinkList.size()]);

parseData = new ParseData(parseData.getStatus(), parseData

       .getTitle(), filteredLinks, parseData.getContentMeta(),

       parseData.getParseMeta());

dataOut.append(key, parseData);

if (!parse.isCanonical()) {

    CrawlDatum datum = new CrawlDatum();

    datum.setStatus(CrawlDatum.STATUS_FETCH_SUCCESS);

    String timeString = parse.getData().getContentMeta().get(

           Nutch.FETCH_TIME_KEY);

    try {

       datum.setFetchTime(Long.parseLong(timeString));

    }

    crawlOut.append(key, datum);

}

         前面是将Parsed data写入,再将非canonicalURL设置状态和抓取时间写入。

 

 

 

 

 

 

 

 

  评论这张
 
阅读(2277)| 评论(0)
推荐 转载

历史上的今天

评论

<#--最新日志,群博日志--> <#--推荐日志--> <#--引用记录--> <#--博主推荐--> <#--随机阅读--> <#--首页推荐--> <#--历史上的今天--> <#--被推荐日志--> <#--上一篇,下一篇--> <#-- 热度 --> <#-- 网易新闻广告 --> <#--右边模块结构--> <#--评论模块结构--> <#--引用模块结构--> <#--博主发起的投票-->
 
 
 
 
 
 
 
 
 
 
 
 
 
 

页脚

网易公司版权所有 ©1997-2017