注册 登录  
 加关注
   显示下一条  |  关闭
温馨提示!由于新浪微博认证机制调整,您的新浪微博帐号绑定已过期,请重新绑定!立即重新绑定新浪微博》  |  关闭

Koala++'s blog

计算广告学 RTB

 
 
 

日志

 
 

Larbin[4]fetchOpen源代码分析  

2010-04-19 12:10:15|  分类: Larbin |  标签: |举报 |字号 订阅

  下载LOFTER 我的照片书  |

       main函数中调用的下一个函数是fetchDns,将fetchDns分成两部分:

while (global::nbDnsCalls < global::dnsConn

       && global::freeConns->isNonEmpty() && global::IPUrl < maxIPUrls) { // try to avoid too many dns calls

    NamedSite *site = global::dnsSites->tryGet();

    if (site == NULL) {

       break;

    } else {

       site->newQuery();

    }

}

       dnsSites取得一个需要dns解析测试的站点名,用newQuery提出dns解析请求:

void NamedSite::newQuery() {

    // Update our stats

    newId();

    if (global::proxyAddr != NULL) {

       // 略过

    } else if (isdigit(name[0])) {

       // the name already in numbers-and-dots notation

       siteSeen();

       if (inet_aton(name, &addr)) {

           // Yes, it is in numbers-and-dots notation

           siteDNS();

           // Get the robots.txt

           dnsOK();

       } else {

           // No, it isn't : this site is a non sense

           dnsState = errorDns;

           dnsErr();

       }

    } else {

       // submit an adns query

       global::nbDnsCalls++;

       adns_query quer = NULL;

       adns_submit(global::ads, name, (adns_rrtype) adns_r_addr,

              (adns_queryflags) 0, this, &quer);

    }

}

       如果使用代理地址,就是在larbin.conf中指定了代理地址的情况,这种情况略过。下面如果这个地址本来就是用数字加点的格式写的,它可能就是一个IP地址,不用解析。剩下的就要提交一个dns的请求了。

// Read available answers

while (global::nbDnsCalls && global::freeConns->isNonEmpty()) {

    NamedSite *site;

    adns_query quer = NULL;

    adns_answer *ans;

    int res = adns_check(global::ads, &quer, &ans, (void**) &site);

    if (res == ESRCH || res == EAGAIN) {

       // No more query or no more answers

       break;

    }

    global::nbDnsCalls--;

    site->dnsAns(ans);

    free(ans); // ans has been allocated with malloc

}

       在这里判断能不能解析,下面调用dnsAns

void NamedSite::dnsAns(adns_answer *ans) {

    if (ans->status == adns_s_prohibitedcname) {

       // 略过

    } else {

       if (cname != NULL) {

           // 略过

       }

       if (ans->status != adns_s_ok) {

           // 略过

       } else {

           // compute the new addr

           memcpy(&addr, &ans->rrs.addr->addr.inet.sin_addr,

                  sizeof(struct in_addr));

           // Get the robots.txt

           dnsOK();

       }

    }

}

       关于ans->status是一个被禁止的别的的情况,略过,下面也只看正常的情况是得到新的地址,并调用dnsOK函数:

void NamedSite::dnsOK() {

    Connexion *conn = global::freeConns->get();

    char res = getFds(conn, &addr, port);

    if (res != emptyC) {

       conn->timeout = timeoutPage;

       if (global::proxyAddr != NULL) {

           // use a proxy

           conn->request.addString("GET http://");

           conn->request.addString(name);

           char tmp[15];

           sprintf(tmp, ":%u", port);

           conn->request.addString(tmp);

           conn->request.addString("/robots.txt HTTP/1.0\r\nHost: ");

       } else {

           // direct connection

         conn->request.addString("GET /robots.txt HTTP/1.0\r\nHost: ");

       }

       conn->request.addString(name);

       conn->request.addString(global::headersRobots);

       conn->parser = new robots(this, conn);

       conn->pos = 0;

       conn->err = success;

       conn->state = res;

    }

}

       这里是将request组合起来,请求robots.txt

       main中调用下一个函数是fetchOpen

void fetchOpen() {

    static time_t next_call = 0;

    if (global::now < next_call) { // too early to come back

       return;

    }

    int cont = 1;

    while (cont && global::freeConns->isNonEmpty()) {

       IPSite *s = global::okSites->tryGet();

       if (s == NULL) {

           cont = 0;

       } else {

           next_call = s->fetch();

           cont = (next_call == 0);

       }

    }

}

       得到一个站点,然后调用fetch

int IPSite::fetch() {

    if (tab.isEmpty()) {

       // 略过

    } else {

       int next_call = lastAccess + global::waitDuration;

       if (next_call > global::now) {

           global::okSites->rePut(this);

           return next_call;

       } else {

           Connexion *conn = global::freeConns->get();

           url *u = getUrl();

           // We're allowed to fetch this one

           // open the socket and write the request

           char res = getFds(conn, &(u->addr), u->getPort());

           if (res != emptyC) {

              lastAccess = global::now;

              conn->timeout = timeoutPage;

              conn->request.addString("GET ");

              if (global::proxyAddr != NULL) {

                  char *tmp = u->getUrl();

                  conn->request.addString(tmp);

              } else {

                  conn->request.addString(u->getFile());

              }

              conn->request.addString(" HTTP/1.0\r\nHost: ");

              conn->request.addString(u->getHost());

              conn->request.addString(global::headers);

              conn->parser = new html(u, conn);

              conn->pos = 0;

              conn->err = success;

              conn->state = res;

              if (tab.isEmpty()) {

                  isInFifo = false;

              } else {

                  global::okSites->put(this);

              }

              return 0;

           }

       }

    }

}

       为了polite所以不能对一个网站连续的爬取,next_call就是上次爬取的时间加上间隔时间的值,如果还没有到可以爬的时间,就返回下一次可以爬的时间,如果可以爬,就将conn->request等写好,有点像得到robots的时间。

 

 

  评论这张
 
阅读(2250)| 评论(0)
推荐 转载

历史上的今天

评论

<#--最新日志,群博日志--> <#--推荐日志--> <#--引用记录--> <#--博主推荐--> <#--随机阅读--> <#--首页推荐--> <#--历史上的今天--> <#--被推荐日志--> <#--上一篇,下一篇--> <#-- 热度 --> <#-- 网易新闻广告 --> <#--右边模块结构--> <#--评论模块结构--> <#--引用模块结构--> <#--博主发起的投票-->
 
 
 
 
 
 
 
 
 
 
 
 
 
 

页脚

网易公司版权所有 ©1997-2017