注册 登录  
 加关注
   显示下一条  |  关闭
温馨提示!由于新浪微博认证机制调整,您的新浪微博帐号绑定已过期,请重新绑定!立即重新绑定新浪微博》  |  关闭

Koala++'s blog

计算广告学 RTB

 
 
 

日志

 
 

Nutch 1.0 源代码分析[2] Plugin(1)  

2010-03-21 12:58:47|  分类: Nutch |  标签: |举报 |字号 订阅

  下载LOFTER 我的照片书  |

         借着URLNormalizers看一下Nutch的插件机制,在Injector类中的configure类中有一句是:

urlNormalizers = new URLNormalizers(job,

       URLNormalizers.SCOPE_INJECT);

         它调用的是:

public URLNormalizers(Configuration conf, String scope) {

    this.conf = conf;

    this.extensionPoint = PluginRepository

.get(conf).getExtensionPoint(URLNormalizer.X_POINT_ID);

    ObjectCache objectCache = ObjectCache.get(conf);

 

    normalizers = (URLNormalizer[]) objectCache

           .getObject(URLNormalizer.X_POINT_ID + "_" + scope);

    if (normalizers == null) {

       normalizers = getURLNormalizers(scope);

    }

    if (normalizers == EMPTY_NORMALIZERS) {

       normalizers = (URLNormalizer[]) objectCache

              .getObject(URLNormalizer.X_POINT_ID + "_"

+ SCOPE_DEFAULT);

       if (normalizers == null) {

           normalizers = getURLNormalizers(SCOPE_DEFAULT);

       }

    }

 

    loopCount = conf.getInt("urlnormalizer.loop.count", 1);

}

         这里的getExtensionPoint是得到相应的扩展点,这里URLNormailizer.X_POINT_IDorg.apache.nutch.net.URLNormalizer,关于扩展点可以看一下IBM的技术文章《Nutch 插件系统浅析》,接下来先到缓存中去找,如果没有找到就调用getURLNormalizers(),如果normalizers==EMPTY_NORMALIZERS说明它应该在缓存里有,如果缓存里存的是null,就用默认的normalizer,而loopCount是在规范化时指定要循环多少次的一个值。getURLNormalizers代码如下:

URLNormalizer[] getURLNormalizers(String scope) {

    List<Extension> extensions = getExtensions(scope);

 

    List<URLNormalizer> normalizers = new

        Vector<URLNormalizer>(extensions.size());

 

    Iterator<Extension> it = extensions.iterator();

    while (it.hasNext()) {

       Extension ext = it.next();

       URLNormalizer normalizer = null;

       try {

           normalizer = (URLNormalizer) objectCache

.getObject(ext.getId());

           if (normalizer == null) {

              // go ahead and instantiate it and then cache it

              normalizer = (URLNormalizer) ext.getExtensionInstance();

              objectCache.setObject(ext.getId(), normalizer);

           }

           normalizers.add(normalizer);

       } catch (PluginRuntimeException e) {

       }

    }

    return normalizers.toArray(new URLNormalizer[normalizers.size()]);

}

         得到相应scopeExtensions,先不去管它是如何得到的,这里将得到的Extension实例化,保存到normalizers中。下面则是getExtensions的代码:

private List<Extension> getExtensions(String scope) {

    ObjectCache objectCache = ObjectCache.get(conf);

    List<Extension> extensions = (List<Extension>) objectCache

           .getObject(URLNormalizer.X_POINT_ID + "_x_" + scope);

 

    // Just compare the reference:

    // if this is the empty list, we know we will find no extension.

    if (extensions == EMPTY_EXTENSION_LIST) {

       return EMPTY_EXTENSION_LIST;

    }

 

    if (extensions == null) {

       extensions = findExtensions(scope);

       if (extensions != null) {

           objectCache.setObject(URLNormalizer.X_POINT_ID

+ "_x_" + scope, extensions);

       } else {

           // Put the empty extension list into cache

           // to remember we don't know any related extension.

           objectCache.setObject(URLNormalizer.X_POINT_ID

+ "_x_" + scope, EMPTY_EXTENSION_LIST);

           extensions = EMPTY_EXTENSION_LIST;

       }

    }

    return extensions;

}

         还是一样,先从缓存里找,如果没有找到,则调用findExtensions,如果是null就在缓存中保存null,而如果找到,则在缓存里放入空的Extension列表。findExtension的代码如下:

private List<Extension> findExtensions(String scope) {

    String[] orders = null;

    String orderlist = conf.get("urlnormalizer.order." + scope);

    if (orderlist == null)

       orderlist = conf.get("urlnormalizer.order");

    if (orderlist != null && !orderlist.trim().equals("")) {

       orders = orderlist.split("\\s+");

    }

    String scopelist = conf.get("urlnormalizer.scope." + scope);

    Set<String> impls = null;

    if (scopelist != null && !scopelist.trim().equals("")) {

       String[] names = scopelist.split("\\s+");

       impls = new HashSet<String>(Arrays.asList(names));

    }

    Extension[] extensions = this.extensionPoint.getExtensions();

    HashMap<String, Extension> normalizerExtensions =

        new HashMap<String, Extension>();

    for (int i = 0; i < extensions.length; i++) {

       Extension extension = extensions[i];

       if (impls != null && !impls.contains(extension.getClazz()))

           continue;

       normalizerExtensions.put(extension.getClazz(), extension);

    }

    List<Extension> res = new ArrayList<Extension>();

    if (orders == null) {

       res.addAll(normalizerExtensions.values());

    } else {

       // first add those explicitly named in correct order

       for (int i = 0; i < orders.length; i++) {

           Extension e = normalizerExtensions.get(orders[i]);

           if (e != null) {

              res.add(e);

              normalizerExtensions.remove(orders[i]);

           }

       }

       // then add all others in random order

       res.addAll(normalizerExtensions.values());

    }

    return res;

}

         关于scope可以看一下注释You can define a set of contexts (or scopes) in which normalizers may be called. Each scope can have its own list of normalizers (defined in "urlnormalizer.scope.<scope_name>" property) and its own order (defined in  "urlnormalizer.order.<scope_name>" property). If any of these properties are missing, default settings are used for the global scope.,大意是你可以再定义一个更小的域,处理关于这个子域的特征情况,比如在Injector中调用,可以指定urlnormalizer.order.injector。默认情况下,在injector域中,在配置中使用的是默认的urlnormalizer.order,相应的配置在nutch-default中:

<property>

  <name>urlnormalizer.order</name>

  <value>org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer</value>

  <description>Order in which normalizers will run. If any of these isn't

  activated it will be silently skipped. If other normalizers not on the

  list are activated, they will run in random order after the ones

  specified here are run.

  </description>

</property>

         接下来的代码是从urlnormalizer.order中得到相应的normalizer顺序,再从扩展点中得到相应的normalizerExtension,然后根据normalizer的顺序将extensions放到res中。

         Injector中调用normalize的代码是:

url = urlNormalizers

       .normalize(url, URLNormalizers.SCOPE_INJECT);

         URLNormalizers的注释写到:This class uses a "chained filter" pattern to run defined normalizers. 也就是要按配置中的顺序依次进行normalize

public String normalize(String urlString, String scope)

       throws MalformedURLException {

    // optionally loop several times, and break if no further changes

    String initialString = urlString;

    for (int k = 0; k < loopCount; k++) {

       for (int i = 0; i < this.normalizers.length; i++) {

           if (urlString == null)

              return null;

           urlString = this.normalizers[i].normalize(urlString, scope);

       }

       if (initialString.equals(urlString))

           break;

       initialString = urlString;

    }

    return urlString;

}

         normlizers中的 Normalizer对象进行规范化,到loopCount次或与上次规范化的结果已经没有差异了,那么停止。

 

 

 

 

 

 

  评论这张
 
阅读(1791)| 评论(0)
推荐 转载

历史上的今天

评论

<#--最新日志,群博日志--> <#--推荐日志--> <#--引用记录--> <#--博主推荐--> <#--随机阅读--> <#--首页推荐--> <#--历史上的今天--> <#--被推荐日志--> <#--上一篇,下一篇--> <#-- 热度 --> <#-- 网易新闻广告 --> <#--右边模块结构--> <#--评论模块结构--> <#--引用模块结构--> <#--博主发起的投票-->
 
 
 
 
 
 
 
 
 
 
 
 
 
 

页脚

网易公司版权所有 ©1997-2017