注册 登录  
 加关注
   显示下一条  |  关闭
温馨提示!由于新浪微博认证机制调整,您的新浪微博帐号绑定已过期,请重新绑定!立即重新绑定新浪微博》  |  关闭

Koala++'s blog

计算广告学 RTB

 
 
 

日志

 
 

Weka开发[45]——BayesNet源代码分析  

2010-05-27 15:10:09|  分类: 机器学习 |  标签: |举报 |字号 订阅

  下载LOFTER 我的照片书  |

       Bayesian Network也是一个比较难的算法了,可以看一下Bayesian Network Classifiers in Weka这篇文章。

       buildClassifier函数中,重要的几行是:

// build the network structure

initStructure();

 

// build the network structure

buildStructure();

 

// build the set of CPTs

estimateCPTs();

       函数initStructure为初始化网络结构,buildStructure为构造网络结构,estimateCPTs为计算条件概率表(conditional probability table)

/**

 * Init structure initializes the structure to an empty graph

 * or a Naive Bayes graph (depending on the -N flag).

*/

public void initStructure() throws Exception {

    // reserve memory

    m_ParentSets = new ParentSet[m_Instances.numAttributes()];

 

    for (int iAttribute = 0; iAttribute < m_Instances.numAttributes();

iAttribute++) {

       m_ParentSets[iAttribute] = new ParentSet(m_Instances

              .numAttributes());

    }

} // initStructure

       m_ParentSets是记录下第i个属性(iAttribute)的父结点,ParentSet初始函数为:

public ParentSet(int nMaxNrOfParents) {

    m_nParents = new int[nMaxNrOfParents];

    m_nNrOfParents = 0;

    m_nCardinalityOfParents = 1;

} // ParentSet

       不做什么,也就是一个空图。

接下来看buildStructure,它会调用SearchAlgorithm中的buildStructure

/**

 * buildStructure determines the network structure/graph of the network.

 * The default behavior is creating a network where all nodes have

 * the first node as its parent (i.e., a BayesNet that behaves like

 * a naive Bayes classifier). This method can be overridden by derived

 * classes to restrict the class of network structures that are acceptable.

*/

public void buildStructure(BayesNet bayesNet, Instances instances) throws Exception {

    if (m_bInitAsNaiveBayes) {

        int iClass = instances.classIndex();

        // initialize parent sets to have arrow from classifier node to

        // each of the other nodes

        for (int iAttribute = 0; iAttribute < instances.numAttributes();

iAttribute++) {

            if (iAttribute != iClass) {

                bayesNet.getParentSet(iAttribute).addParent(iClass,

instances);

            }

        }

    }

    search(bayesNet, instances);

    if (m_bMarkovBlanketClassifier) {

        doMarkovBlanketCorrection(bayesNet, instances);

    }

} // buildStructure

       这里会判断是不是初始化成朴素贝叶斯,如果不初始化为朴素贝叶斯,那么就还是空图,如果初始为朴素贝叶斯,则对于每个属性将类别属性加为父结点。addParent的代码如下:

public void addParent(int nParent, Instances _Instances) {

    if (m_nNrOfParents == 10) {

       // reserve more memory

       int[] nParents = new int[50];

       for (int i = 0; i < m_nNrOfParents; i++) {

           nParents[i] = m_nParents[i];

       }

       m_nParents = nParents;

    }

    m_nParents[m_nNrOfParents] = nParent;

    m_nNrOfParents++;

    m_nCardinalityOfParents *= _Instances.attribute(nParent)

                                       .numValues();

} // AddParent

       前面的if是预保留内存的代码,后面的是保存哪个属性是它的父结点,m_NrOfParent是父结点数,CardinalityOfParents是父结点所能取的所有属性值之和。

       search函数的实现有很多,这里看K2的代码实现:

int nOrder[] = new int[instances.numAttributes()];

nOrder[0] = instances.classIndex();

 

int nAttribute = 0;

 

for (int iOrder = 1; iOrder < instances.numAttributes(); iOrder++) {

    if (nAttribute == instances.classIndex()) {

       nAttribute++;

    }

    nOrder[iOrder] = nAttribute++;

}

       nOrder中类别属性下标为0,其实它属性顺序还是一样的。

// determine base scores

double[] fBaseScores = new double[instances.numAttributes()];

for (int iOrder = 0; iOrder < instances.numAttributes(); iOrder++) {

    int iAttribute = nOrder[iOrder];

    fBaseScores[iAttribute] = calcNodeScore(iAttribute);

}

       计算base scores,调用calcNodeScore函数:

public double calcNodeScore(int nNode) {

    if (m_BayesNet.getUseADTree() && m_BayesNet.getADTree() != null) {

       return calcNodeScoreADTree(nNode);

    } else {

       return calcNodeScorePlain(nNode);

    }

}

       ADTree就暂时不去理会了,看calcNodeScorePlain函数:

// estimate distributions

Enumeration enumInsts = instances.enumerateInstances();

 

while (enumInsts.hasMoreElements()) {

    Instance instance = (Instance) enumInsts.nextElement();

 

    // updateClassifier;

    double iCPT = 0;

 

    for (int iParent = 0; iParent < oParentSet.getNrOfParents();

            iParent++) {

       int nParent = oParentSet.getParent(iParent);

 

       iCPT = iCPT * instances.attribute(nParent).numValues()

                  + instance.value(nParent);

    }

 

    nCounts[numValues * ((int) iCPT) + (int) instance.value(nNode)]++;

}

       这里的nCounts是文章Bayesian Network Classifiers in Weka中第4页所提到的Nijk,这里是将i,j,k三维放到了一些,类别值是最后的instance.value(nNode)

       calcNodeScorePlain函数中最后调用了calcScoreOfCount函数:

for (int iParent = 0; iParent < nCardinality; iParent++) {

    switch (m_nScoreType) {

 

    case (Scoreable.BAYES): {

       double nSumOfCounts = 0;

 

       for (int iSymbol = 0; iSymbol < numValues; iSymbol++) {

          if (m_fAlpha + nCounts[iParent * numValues + iSymbol] != 0) {

              fLogScore += Statistics.lnGamma(m_fAlpha

                     + nCounts[iParent * numValues + iSymbol]);

              nSumOfCounts += m_fAlpha

                     + nCounts[iParent * numValues + iSymbol];

           }

       }

 

       if (nSumOfCounts != 0) {

           fLogScore -= Statistics.lnGamma(nSumOfCounts);

       }

 

       if (m_fAlpha != 0) {

           fLogScore -= numValues * Statistics.lnGamma(m_fAlpha);

           fLogScore += Statistics.lnGamma(numValues * m_fAlpha);

       }

    }

       可以看Bayesian Network Classifiers in Weka6页中的Bayesian metric中的公式,第一个for是计算Gamma(Nijk prime+ Nijk)。接下来是计算Gamma(Nijk prime+ Nij),再将下来是计算Gamma(Nij prime)/Gamma(Nij prime+ Nij)

case (Scoreable.MDL):

case (Scoreable.AIC):

case (Scoreable.ENTROPY): {

    double nSumOfCounts = 0;

 

    for (int iSymbol = 0; iSymbol < numValues; iSymbol++) {

       nSumOfCounts += nCounts[iParent * numValues + iSymbol];

    }

 

    for (int iSymbol = 0; iSymbol < numValues; iSymbol++) {

       if (nCounts[iParent * numValues + iSymbol] > 0) {

           fLogScore += nCounts[iParent * numValues + iSymbol]

                  * Math.log(nCounts[iParent * numValues

                         + iSymbol] / nSumOfCounts);

       }

    }

}

       这里相应于Bayesian Network Classifiers in Weka5页的公式(2)不同之处是没有N,因为它可以消掉。

switch (m_nScoreType) {

case (Scoreable.MDL): {

    fLogScore -= 0.5 * nCardinality * (numValues - 1)

           * Math.log(instances.numInstances());

}

    break;

 

case (Scoreable.AIC): {

    fLogScore -= nCardinality * (numValues - 1);

}

    break;

}

       公式中的K=nCardinality * (numValues - 1)N=instances.numInstances()。见公式(3)MDLAIC的计算见公式(5)

// K2 algorithm: greedy search restricted by ordering

for (int iOrder = 1; iOrder < instances.numAttributes(); iOrder++) {

    int iAttribute = nOrder[iOrder];

    double fBestScore = fBaseScores[iAttribute];

 

    boolean bProgress = (bayesNet.getParentSet(iAttribute)

           .getNrOfParents() < getMaxNrOfParents());

    while (bProgress) {

       int nBestAttribute = -1;

       for (int iOrder2 = 0; iOrder2 < iOrder; iOrder2++) {

           int iAttribute2 = nOrder[iOrder2];

           double fScore = calcScoreWithExtraParent(iAttribute,

                  iAttribute2);

           if (fScore > fBestScore) {

              fBestScore = fScore;

              nBestAttribute = iAttribute2;

           }

       }

       if (nBestAttribute != -1) {

           bayesNet.getParentSet(iAttribute).addParent(

nBestAttribute, instances);

           fBaseScores[iAttribute] = fBestScore;

           bProgress = (bayesNet.getParentSet(iAttribute)

                  .getNrOfParents() < getMaxNrOfParents());

       } else {

           bProgress = false;

       }

    }

}

       bProgress是判断iAttribute结点的父结点是否超过最大父结点数,代码逻辑大致是:对每个属性(iOrder)得到它的父结点,找它的父结点是在[0-iOrder]中找,不然就循环了。对于每个属性调用calcScoreWithExtraParent函数计算得到,如果它比以前的结分高,那它成为最好的属性,调用addParent加入。

public double calcScoreWithExtraParent(int nNode, int nCandidateParent) {

    ParentSet oParentSet = m_BayesNet.getParentSet(nNode);

 

    // sanity check: nCandidateParent should not be in parent set already

    if (oParentSet.contains(nCandidateParent)) {

       return -1e100;

    }

 

    // set up candidate parent

    oParentSet.addParent(nCandidateParent, m_BayesNet.m_Instances);

 

    // calculate the score

    double logScore = calcNodeScore(nNode);

 

    // delete temporarily added parent

    oParentSet.deleteLastParent(m_BayesNet.m_Instances);

 

    return logScore;

} // CalcScoreWithExtraParent

       这个函数名已经反映出函数的内容,它将要计算的结点(nCandidateParent)先加入到父结点集合(oParentSet)中,调用calcNodeScore计算得分,这个函数已经看过了,接下来再删除刚加入的父结点,也就是还原以前的父结点集合。

       再看estimateCPTs函数,这里看默认的SimpleEstimator函数:

/**

 * estimateCPTs estimates the conditional probability tables for the Bayes

 * Net using the network structure.

*/

public void estimateCPTs(BayesNet bayesNet) throws Exception {

        initCPTs(bayesNet);

 

        // Compute counts

        Enumeration enumInsts = bayesNet.m_Instances

.enumerateInstances();

        while (enumInsts.hasMoreElements()) {

            Instance instance = (Instance) enumInsts.nextElement();

 

            updateClassifier(bayesNet, instance);

        }

} // estimateCPTs

       函数initCPTs初始化CPT,再用增量方式学习CPT

/**

 * initCPTs reserves space for CPTs and set all counts to zero

*/

public void initCPTs(BayesNet bayesNet) throws Exception {

    Instances instances = bayesNet.m_Instances;

   

    // Reserve space for CPTs

    int nMaxParentCardinality = 1;

    for (int iAttribute = 0; iAttribute < instances.numAttributes();

               iAttribute++) {

        if (bayesNet.getParentSet(iAttribute).getCardinalityOfParents()

                > nMaxParentCardinality) {

            nMaxParentCardinality = bayesNet.getParentSet(iAttribute)

              .getCardinalityOfParents();

        }

    }

 

    // Reserve plenty of memory

    bayesNet.m_Distributions = new Estimator[instances.numAttributes()]

                                             [nMaxParentCardinality];

 

    // estimate CPTs

    for (int iAttribute = 0; iAttribute < instances.numAttributes();

            iAttribute++) {

        for (int iParent = 0; iParent < bayesNet.getParentSet(iAttribute)

                  .getCardinalityOfParents(); iParent++) {

            bayesNet.m_Distributions[iAttribute][iParent] =

                new DiscreteEstimatorBayes(instances.attribute(

iAttribute).numValues(), m_fAlpha);

        }

    }

} // initCPTs

       每一个for循环是求出nMaxParentCardinality,也就是求得为m_Distributions分配的空间大小,将下来对每个属性结点的父结点都进行初始化。

public void updateClassifier(BayesNet bayesNet, Instance instance)

       throws Exception {

    for (int iAttribute = 0; iAttribute < bayesNet.m_Instances

           .numAttributes(); iAttribute++) {

       double iCPT = 0;

 

       for (int iParent = 0; iParent < bayesNet.getParentSet(iAttribute)

              .getNrOfParents(); iParent++) {

           int nParent = bayesNet.getParentSet(iAttribute).getParent(

                  iParent);

 

           iCPT = iCPT * bayesNet.m_Instances.attribute(nParent)

                  .numValues() + instance.value(nParent);

       }

 

       bayesNet.m_Distributions[iAttribute][(int) iCPT]

               .addValue(instance.value(iAttribute),

instance.weight());

    }

} // updateClassifier

       calcNodeScorePlain的计数方式相似,只是这里用的是二维数组,接下来看DiscreteEstimate的初始化函数和addValue函数。

public DiscreteEstimatorBayes(int nSymbols, double fPrior) {

    m_fPrior = fPrior;

    m_nSymbols = nSymbols;

    m_Counts = new double[m_nSymbols];

 

    for (int iSymbol = 0; iSymbol < m_nSymbols; iSymbol++) {

       m_Counts[iSymbol] = m_fPrior;

    }

 

    m_SumOfCounts = m_fPrior * (double) m_nSymbols;

} // DiscreteEstimatorBayes

       没什么特别的,初始化m_Countm_SumOfCount

public void addValue(double data, double weight) {

    m_Counts[(int) data] += weight;

    m_SumOfCounts += weight;

}

       很简单,将样本的权重累加到相应的元素上。

       现在看一下SimpleEstimator

for (int iClass = 0; iClass < nNumClasses; iClass++) {

    fProbs[iClass] = 1.0;

}

       这段代码我有点疑问,因为看它下面的代码注释,似乎它以前是用连乘方式计算,后来改用先求对数之和,再求幂。以前用连乘方式初始为1没问题,但如果先求对数,应该初始化为0才对。

for (int iClass = 0; iClass < nNumClasses; iClass++) {

    double logfP = 0;

 

    for (int iAttribute = 0; iAttribute < instances.numAttributes();

iAttribute++) {

       double iCPT = 0;

 

       for (int iParent = 0; iParent < bayesNet.getParentSet(

              iAttribute).getNrOfParents(); iParent++) {

           int nParent = bayesNet.getParentSet(iAttribute).getParent(

                  iParent);

 

           if (nParent == instances.classIndex()) {

              iCPT = iCPT * nNumClasses + iClass;

           } else {

              iCPT = iCPT * instances.attribute(nParent).numValues()

                     + instance.value(nParent);

           }

       }

 

       if (iAttribute == instances.classIndex()) {

           logfP += Math

                  .log(bayesNet.m_Distributions[iAttribute][(int)

                          iCPT].getProbability(iClass));

       } else {

           logfP += Math

                  .log(bayesNet.m_Distributions[iAttribute][(int)

                      iCPT].getProbability(instance.value(iAttribute)));

       }

    }

 

    fProbs[iClass] += logfP;

}

       最内层循环是得到iCPT,即找到在m_Distribution第二维中的相应位置,代码可以参考第3页的公式(1),而getProbabilityDiscreteEstimatorBayes中的代码为:

public double getProbability(double data) {

    if (m_SumOfCounts == 0) {

       // this can only happen if numSymbols = 0 in constructor

       return 0;

    }

 

    return (double) m_Counts[(int) data] / m_SumOfCounts;

}

       这里相当于是条件概率,注意,这里是求对数之和,所以是logfP+=

// Find maximum

double fMax = fProbs[0];

for (int iClass = 0; iClass < nNumClasses; iClass++) {

    if (fProbs[iClass] > fMax) {

       fMax = fProbs[iClass];

    }

}

// transform from log-space to normal-space

for (int iClass = 0; iClass < nNumClasses; iClass++) {

    fProbs[iClass] = Math.exp(fProbs[iClass] - fMax);

}

 

// Display probabilities

Utils.normalize(fProbs);

 

return fProbs;

       求得最大的fProbs,然后用类似归一化的方式将对数空间转换到普通空间,再用normalize使概率和为1

 

 

 

  评论这张
 
阅读(4326)| 评论(2)
推荐 转载

历史上的今天

评论

<#--最新日志,群博日志--> <#--推荐日志--> <#--引用记录--> <#--博主推荐--> <#--随机阅读--> <#--首页推荐--> <#--历史上的今天--> <#--被推荐日志--> <#--上一篇,下一篇--> <#-- 热度 --> <#-- 网易新闻广告 --> <#--右边模块结构--> <#--评论模块结构--> <#--引用模块结构--> <#--博主发起的投票-->
 
 
 
 
 
 
 
 
 
 
 
 
 
 

页脚

网易公司版权所有 ©1997-2017