注册 登录  
 加关注
   显示下一条  |  关闭
温馨提示!由于新浪微博认证机制调整,您的新浪微博帐号绑定已过期,请重新绑定!立即重新绑定新浪微博》  |  关闭

Koala++'s blog

计算广告学 RTB

 
 
 

日志

 
 

Nutch 1.0 源代码分析[13] ScoringFilters  

2010-03-25 15:54:25|  分类: Nutch |  标签: |举报 |字号 订阅

  下载LOFTER 我的照片书  |

         Nutch的打分理论部分可以在Fixing the OPIC algorithm in Nutch这篇文章中找到。

下面直接看程序,第一次出现它是在Injector的内部类InjectMappermap函数中:

try {

    scfilters.injectedScore(value, datum);

} catch (ScoringFilterException e) {

    datum.setScore(scoreInjected);

}

         injectedScore是在注入一个新网页时,计算一个初始得分:

/** Calculate a new initial score, used when injecting new pages. */

public void injectedScore(Text url, CrawlDatum datum)

       throws ScoringFilterException {

    for (int i = 0; i < this.filters.length; i++) {

       this.filters[i].injectedScore(url, datum);

    }

}

         默认的filters就只有OPICScoringFilter一个,它的injectedScore非常简单:

/** Set to the value defined in config, 1.0f by default. */

public void injectedScore(Text url, CrawlDatum datum)

       throws ScoringFilterException {

    datum.setScore(scoreInjected);

}

         默认给新注入的网页值设为1.0

第二次出现的是GeneratorSelectormap函数中:

float sort = 1.0f;

try {

    sort = scfilters.generatorSortValue((Text) key, crawlDatum,

           sort);

} catch (ScoringFilterException sfe) {

}

// sort by decreasing score, using DecreasingFloatComparator

sortValue.set(sort);

// record generation time

crawlDatum.getMetaData().put(Nutch.WRITABLE_GENERATE_TIME_KEY,

       genTime);

entry.datum = crawlDatum;

entry.url = (Text) key;

output.collect(sortValue, entry); // invert for sort by score

         generatorSortValue是一个用于排序的分,Selector最后以这个得分降序输出。

/** Calculate a sort value for Generate. */

public float generatorSortValue(Text url, CrawlDatum datum, float initSort) throws ScoringFilterException {

    for (int i = 0; i < this.filters.length; i++) {

       initSort = this.filters[i].generatorSortValue(url, datum,

             initSort);

    }

    return initSort;

}

         直接看OPICScoringFilter的实现:

public float generatorSortValue(Text url, CrawlDatum datum, float initSort) throws ScoringFilterException {

    return datum.getScore() * initSort;

}

         initSort值是1.0,这里是没有影响的,相当于只返回datum.getScore()

         第三次出现是在FetcherFetcherThreadoutput函数里:

// add score to content metadata so that ParseSegment can pick it up.

try {

    scfilters.passScoreBeforeParsing(key, datum, content);

} catch (Exception e) {

}

try {

    scfilters.passScoreAfterParsing(url, content, parse);

} catch (Exception e) {

}

         注释上写到,把score放到metadata中,可以让ParseSegment可以得到它。OPICScoringFilter中的实现为:

/** Store a float value of CrawlDatum.getScore() under Fetcher.SCORE_KEY. */

public void passScoreBeforeParsing(Text url, CrawlDatum datum,

       Content content) {

    content.getMetadata().set(Nutch.SCORE_KEY, "" + datum.getScore());

}

 

/**

 * Copy the value from Content metadata under Fetcher.SCORE_KEY to

 * parseData. */

public void passScoreAfterParsing(Text url, Content content, Parse parse) {

    parse.getData().getContentMeta().set(Nutch.SCORE_KEY,

           content.getMetadata().get(Nutch.SCORE_KEY));

}

         将得分拷贝到content的元数据中,解析完页面后再将这个分拷贝到parse数据的contentMeta中。

         第四次出现在ParserOutputFormatgetRecordWriter

CrawlDatum target = new CrawlDatum(CrawlDatum.STATUS_LINKED, interval);

Text targetUrl = new Text(toUrl);

try {

    scfilters.initialScore(targetUrl, target);

} catch (ScoringFilterException e) {

    target.setScore(0.0f);

}

 

targets.add(new SimpleEntry(targetUrl, target));

outlinkList.add(links[i]);

validCount++;

              }

try {

    // compute score contributions and adjustment to the

    // original score

    adjust = scfilters.distributeScoreToOutlinks((Text) key,

           parseData, targets, null, links.length);

} catch (ScoringFilterException e) {

}

         OPICScoringFilter实现的initialScore为:

/**

 *Set to 0.0f (unknown value) - inlink contributions will bring it to a

 * correct level. Newly discovered pages have at least one inlink.*/

public void initialScore(Text url, CrawlDatum datum)

       throws ScoringFilterException {

    datum.setScore(0.0f);

}

初始化的时候将值设为0,注释里写到,因这个页面至少是由一个别的页面指向的,那以在inlink distributions会算出一个合理的值。

distributeScoreToOutlinks拆成两部分看:

float score = scoreInjected;

String scoreString = parseData.getContentMeta().get(Nutch.SCORE_KEY);

if (scoreString != null) {

    try {

       score = Float.parseFloat(scoreString);

    } catch (Exception e) {

       e.printStackTrace(LogUtil.getWarnStream(LOG));

    }

}

int validCount = targets.size();

if (countFiltered) {

    score /= allCount;

} else {

    if (validCount == 0) {

       // no outlinks to distribute score, so just return adjust

       return adjust;

    }

    score /= validCount;

}

如果原页里是没有值的,就用原始的scoreInjected,如果有值,就得到它,再计算这个页面上一共有多少个链接,每个链接的得分就是:这个页面的值的除以链接数。

// internal and external score factor

float internalScore = score * internalScoreFactor;

float externalScore = score * externalScoreFactor;

for (Entry<Text, CrawlDatum> target : targets) {

    try {

       String toHost = new URL(target.getKey().toString()).getHost();

       String fromHost = new URL(fromUrl.toString()).getHost();

       if (toHost.equalsIgnoreCase(fromHost)) {

           target.getValue().setScore(internalScore);

       } else {

           target.getValue().setScore(externalScore);

       }

    } catch (MalformedURLException e) {

       e.printStackTrace(LogUtil.getWarnStream(LOG));

       target.getValue().setScore(externalScore);

    }

}

原始的internalScoreFactorexternalScoreFactor都是1.0,它两个值用于与原页面在同一站点和不在所乘的因子。

第五次出现在CrawlDbReducer里:

try {

    scfilters.updateDbScore((Text) key, oldSet ? old : null, result,

           linked);

} catch (Exception e) {

}

oldSet是判断是不是以前就见过的,它的实现如下:

/** Increase the score by a sum of inlinked scores. */

public void updateDbScore(Text url, CrawlDatum old, CrawlDatum datum,

       List inlinked) throws ScoringFilterException {

    float adjust = 0.0f;

    for (int i = 0; i < inlinked.size(); i++) {

       CrawlDatum linked = (CrawlDatum) inlinked.get(i);

       adjust += linked.getScore();

    }

    if (old == null)

       old = datum;

    datum.setScore(old.getScore() + adjust);

}

这里是把所有指向它的所有链接的分加上,再把它原始的分再加上。

第六次出现在IndexerMapReducerreduce函数:

float boost = 1.0f;

// run scoring filters

try {

    boost = this.scfilters.indexerScore(key, doc, dbDatum, fetchDatum,

           parse, inlinks, boost);

} catch (final ScoringFilterException e) {

    return;

}

// apply boost to all indexed fields.

doc.setScore(boost);

// store boost for use by explain and dedup

doc.add("boost", Float.toString(boost));

         OPICScoringFilter中的实现为:

/** Dampen the boost value by scorePower. */

public float indexerScore(Text url, NutchDocument doc, CrawlDatum dbDatum, CrawlDatum fetchDatum, Parse parse, Inlinks inlinks, float initScore)

throws ScoringFilterException {

    return (float) Math.pow(dbDatum.getScore(), scorePower) * initScore;

}

         这里是用scorePower来减弱boost的值,scorePower的默认值是0.5

 

 

 

 

 

 

 

 

  评论这张
 
阅读(2016)| 评论(0)
推荐 转载

历史上的今天

评论

<#--最新日志,群博日志--> <#--推荐日志--> <#--引用记录--> <#--博主推荐--> <#--随机阅读--> <#--首页推荐--> <#--历史上的今天--> <#--被推荐日志--> <#--上一篇,下一篇--> <#-- 热度 --> <#-- 网易新闻广告 --> <#--右边模块结构--> <#--评论模块结构--> <#--引用模块结构--> <#--博主发起的投票-->
 
 
 
 
 
 
 
 
 
 
 
 
 
 

页脚

网易公司版权所有 ©1997-2017