注册 登录  
 加关注
   显示下一条  |  关闭
温馨提示!由于新浪微博认证机制调整,您的新浪微博帐号绑定已过期,请重新绑定!立即重新绑定新浪微博》  |  关闭

Koala++'s blog

计算广告学 RTB

 
 
 

日志

 
 

Srilm——Discount源代码分析  

2010-06-18 17:25:43|  分类: NLP |  标签: |举报 |字号 订阅

  下载LOFTER 我的照片书  |

       Speech and Language Processing 2nd4N-Grams中的Soomthing方法在Srilm中基本上都有。

第一种Laplace平滑,这种平滑在NLP中不太好,原文在第419页,The sharp change in counts and probabilities occurs because too much probability mass is moved to all the zeros. 这里提到的sharp change是指的18页的图4.6,所有的值都很接近0。它在Srilm中实现的是17页的方法,看一下它的定义:

/*

 * AddSmooth --

 *  Lidstone-Johnson-Jeffrey's smoothing: add a constant delta to the

 *  occurrence count of each vocabulary item.

 *

 *     p = (c + delta) / (T + N * delta)

 *

 *  where c is the item count, T is the total count, and N is the

 *  vocabulary size. This is equivalent to a discounting factor of

 *

 *     d = (1 + delta/c) / (1 + N * delta / T)

 */

class AddSmooth: public Discount

{

public:

    AddSmooth(double delta = 1.0, unsigned mincount = 0)

    : _delta(delta < 0.0 ? 0.0 : delta),

      _mincount(mincount) {};

 

    double discount(Count count, Count totalCount, Count observedVocab)

      { return (count <= 0) ? 1.0 : (count < _mincount) ? 0.0 :

                  (1.0 + _delta/count) /

                  (1.0 + _vocabSize*_delta/totalCount); }

    double discount(FloatCount count, FloatCount totalCount,

                         Count observedVocab)

      { return (count <= 0.0) ? 1.0 : (count < _mincount) ? 0.0 :

                  (1.0 + _delta/count) /

                  (1.0 + _vocabSize*_delta/totalCount); }

 

    Boolean nodiscount() { return _mincount <= 1.0 && _delta == 0.0; } ;

 

    Boolean estimate(NgramStats &counts, unsigned order)

      { _vocabSize = vocabSize(counts.vocab); return true; }

    Boolean estimate(NgramCounts<FloatCount> &counts, unsigned order)

      { _vocabSize = vocabSize(counts.vocab); return true; }

 

protected:

    double _delta;        /* the additive constant */

    double _mincount;         /* minimum count to retain */

    unsigned _vocabSize;      /* vocabulary size */

};

       注释里写的已经比较清楚了,也就是Laplace分子加1,分母加N,而现在是加delta,这是因为词汇的量很大,加1会使所有的值都很小,这里的delta的值是用户自己设置的,默认的就是laplace平滑。注释第2个公式算法中计算的公式,也就是(1.0 + _delta/count) / (1.0 + _vocabSize*_delta/totalCount)这句。

下面看一下lm(Language Model)下的Discount.hGoodTuring的定义:

class GoodTuring: public Discount

{

public:

    GoodTuring(unsigned mincount = GT_defaultMinCount,

           unsigned maxcount = GT_defaultMaxCount);

       

    double discount(Count count, Count totalCount, Count observedVocab);

    Boolean nodiscount();

 

    void write(File &file);

    Boolean read(File &file);

 

    Boolean estimate(NgramStats &counts, unsigned order);

    Boolean estimate(NgramCounts<FloatCount> &counts, unsigned order)

    { return false; };

 

protected:

    Count minCount;           /* counts below this are set to 0 */

    Count maxCount;           /* counts above this are unchanged */

 

    Array<double> discountCoeffs;   /* cached discount coefficients */

};

       看一下GoodTuring的构造函数,它两个参数缺省值是15

GoodTuring::GoodTuring(unsigned mincount, unsigned maxcount) :

    minCount(mincount), maxCount(maxcount), discountCoeffs(0) {

    /*

     * a zero count cannot be discounted

     */

    discountCoeffs[0] = 1.0;

}

       The Good-Turing intuition is to use the frequency of singletons as a re-estimate of the frequency of zero-count bigrams,所以这里的计数为0的词当然不能再discount了。

for (i = 0; i <= maxCount + 1; i++) {

    countOfCounts[i] = 0;

}

 

while (count = iter.next()) {

    if (counts.vocab.isNonEvent(wids[order - 1])) {

       continue;

    } else if (counts.vocab.isMetaTag(wids[order - 1])) {

       unsigned type = counts.vocab.typeOfMetaTag(wids[order - 1]);

 

       /*

        * process count-of-count

        */

       if (type > 0 && type <= maxCount + 1) {

           countOfCounts[type] += *count;

       }

    } else if (*count <= maxCount + 1) {

       countOfCounts[*count]++;

    }

}

       countOfCounts数组的下标表示的是词的出现次数,比如countOfCounts[3]=4表示出现了3次的词有有4个。

double commonTerm = (maxCount + 1) * (double) countOfCounts[maxCount

       + 1] / (double) countOfCounts[1];

 

for (i = 1; i <= maxCount; i++) {

    double coeff;

 

    if (countOfCounts[i] == 0) {

       cerr << "warning: count of count " << i << " is zero\n";

       coeff = 1.0;

    } else {

       double coeff0 = (i + 1) * (double) countOfCounts[i + 1] / (i

              * (double) countOfCounts[i]);

       coeff = (coeff0 - commonTerm) / (1.0 - commonTerm);

       if (!finite(coeff) || coeff <= Prob_Epsilon || coeff0 > 1.0) {

           cerr << "warning: discount coeff " << i

                  << " is out of range: " << coeff << "\n";

           coeff = 1.0;

       }

    }

    discountCoeffs[i] = coeff;

}

       这里commonTerm是通过公式4.26中求得的,c*=(c+1)(Nc+1)/ Nc。我想这个大概是做类似归一的事的一个项吧,下面的coeff0才是真正求discount的公式。最后求得的discountCoeffs是通过discount函数使用。

       下面是Kneser Ney DisCounting方法,类定义就不列了,直接看estimate函数:

while (count = iter.next()) {

    if (counts.vocab.isNonEvent(wids[order - 1])) {

       continue;

    } else if (counts.vocab.isMetaTag(wids[order - 1])) {

       unsigned type = counts.vocab.typeOfMetaTag(wids[order - 1]);

 

       /*

        * process count-of-count

        */

       if (type == 1) {

           n1++;

       } else if (type == 2) {

           n2++;

       }

    } else if (*count == 1) {

       n1++;

    } else if (*count == 2) {

       n2++;

    }

}

 

discount1 = n1 / ((double) n1 + 2 * n2);

       这里用书中的公式有点难说明,可以看一下MIT的课件的第9页(http://people.csail.mit.edu/regina/6881/lec03/lec03.pdf),这里的n1n2就是公式中的相对应。

       discount1的计算与公式中有点差别,n2is a number of elements with frequency 2,这里乘以2,大概是把出现次数也乘进去。

       Count*(x)=Count(x)-D这句话在代码中是在discount函数中:

return (count - discount1) / count;

 

 
  评论这张
 
阅读(1943)| 评论(0)
推荐 转载

历史上的今天

评论

<#--最新日志,群博日志--> <#--推荐日志--> <#--引用记录--> <#--博主推荐--> <#--随机阅读--> <#--首页推荐--> <#--历史上的今天--> <#--被推荐日志--> <#--上一篇,下一篇--> <#-- 热度 --> <#-- 网易新闻广告 --> <#--右边模块结构--> <#--评论模块结构--> <#--引用模块结构--> <#--博主发起的投票-->
 
 
 
 
 
 
 
 
 
 
 
 
 
 

页脚

网易公司版权所有 ©1997-2017