注册 登录  
 加关注
   显示下一条  |  关闭
温馨提示!由于新浪微博认证机制调整,您的新浪微博帐号绑定已过期,请重新绑定!立即重新绑定新浪微博》  |  关闭

Koala++'s blog

计算广告学 RTB

 
 
 

日志

 
 

Weka开发[47]——Stacking源代码分析  

2010-08-20 14:32:22|  分类: 机器学习 |  标签: |举报 |字号 订阅

  下载LOFTER 我的照片书  |

         从网上拷了一段解释,这不是什么权威论文,拷贝它只是因为它条理清楚简单。

Stacked generalization (or stacking) (Wolpert 1992) is a different way of combining multiple models, that introduces the concept of a meta learner. Although an attractive idea, it is less widely used than bagging and boosting. Unlike bagging and boosting, stacking may be (and normally is) used to combine models of different types. The procedure is as follows:

1. Split the training set into two disjoint sets.

2. Train several base learners on the first part.

3. Test the base learners on the second part.

4. Using the predictions from 3) as the inputs, and the correct responses as the outputs, train a higher level learner.

Note that steps 1) to 3) are the same as cross-validation, but instead of using a winner-takes-all approach, the base learners are combined, possibly non-linearly.

         buildClassifier中的前一部分就不看了,以前看过多次了,重要的就下面几行:

// Create meta level

generateMetaLevel(newData, random);

 

// Rebuilt all the base classifiers on the full training data

for (int i = 0; i < m_Classifiers.length; i++) {

    getClassifier(i).buildClassifier(newData);

}

         而下面的for循环是在全部数据集上训练所有基分类器,那么最要的也就上面的generateMetaLevel函数了。

protected void generateMetaLevel(Instances newData, Random random)

       throws Exception {

    Instances metaData = metaFormat(newData);

    m_MetaFormat = new Instances(metaData, 0);

    for (int j = 0; j < m_NumFolds; j++) {

       Instances train = newData.trainCV(m_NumFolds, j, random);

 

       // Build base classifiers

       for (int i = 0; i < m_Classifiers.length; i++) {

           getClassifier(i).buildClassifier(train);

       }

 

       // Classify test instances and add to meta data

       Instances test = newData.testCV(m_NumFolds, j);

       for (int i = 0; i < test.numInstances(); i++) {

           metaData.add(metaInstance(test.instance(i)));

       }

    }

 

    m_MetaClassifier.buildClassifier(metaData);

}

         代码的大意是先用newData得到metaData的格式,即m_MetaFormat,然后用原始的数据集newData通过十折交叉的方式拆分,得到traintest两个数据集,train交由m_Classifiers训练,再将test分类,并将分类结果合并到原数据中,再加入到metaData中。最然用m_MetaClassifiermetaData训练。

protected Instances metaFormat(Instances instances) throws Exception {

    FastVector attributes = new FastVector();

    Instances metaFormat;

 

    for (int k = 0; k < m_Classifiers.length; k++) {

       Classifier classifier = (Classifier) getClassifier(k);

       String name = classifier.getClass().getName();

       if (m_BaseFormat.classAttribute().isNumeric()) {

           attributes.addElement(new Attribute(name));

       } else {

           for (int j = 0; j < m_BaseFormat.classAttribute().numValues();

 j++) {

              attributes.addElement(new Attribute(name + ":"

                     + m_BaseFormat.classAttribute().value(j)));

           }

       }

    }

    attributes.addElement(m_BaseFormat.classAttribute().copy());

    metaFormat = new Instances("Meta format", attributes, 0);

    metaFormat.setClassIndex(metaFormat.numAttributes() - 1);

    return metaFormat;

}

         Instanceslevel 0的训练样本集,现在要加入一部分属性用以保存后来的分类结果,如果m_BaseFormat类别属性是连续值,那么就加入m_Classifiers个属性,如果是离散值,每次要加入level 0类别属性取值个数个属性,最后加入metaFormat的类别属性。

         下面是一个我用iris.arff测试得到的结果,它有一个基分类器:

@relation 'Meta format'

 

@attribute weka.classifiers.rules.ZeroR:Iris-setosa numeric

@attribute weka.classifiers.rules.ZeroR:Iris-versicolor numeric

@attribute weka.classifiers.rules.ZeroR:Iris-virginica numeric

@attribute class {Iris-setosa,Iris-versicolor,Iris-virginica}

 

@data

protected Instance metaInstance(Instance instance) throws Exception {

 

    double[] values = new double[m_MetaFormat.numAttributes()];

    Instance metaInstance;

    int i = 0;

    for (int k = 0; k < m_Classifiers.length; k++) {

       Classifier classifier = getClassifier(k);

       if (m_BaseFormat.classAttribute().isNumeric()) {

           values[i++] = classifier.classifyInstance(instance);

       } else {

           double[] dist = classifier.

distributionForInstance(instance);

           for (int j = 0; j < dist.length; j++) {

              values[i++] = dist[j];

           }

       }

    }

    values[i] = instance.classValue();

    metaInstance = new Instance(1, values);

    metaInstance.setDataset(m_MetaFormat);

    return metaInstance;

}

         Values是用来保存分类结果的,如果是连续属性那么就将结果直接保存,如果是离散值,则先求得分布,将每种取值的分布加入values,将它设为m_MetaFormat格式,然后返回。

 

 

 

  评论这张
 
阅读(1662)| 评论(1)
推荐 转载

历史上的今天

评论

<#--最新日志,群博日志--> <#--推荐日志--> <#--引用记录--> <#--博主推荐--> <#--随机阅读--> <#--首页推荐--> <#--历史上的今天--> <#--被推荐日志--> <#--上一篇,下一篇--> <#-- 热度 --> <#-- 网易新闻广告 --> <#--右边模块结构--> <#--评论模块结构--> <#--引用模块结构--> <#--博主发起的投票-->
 
 
 
 
 
 
 
 
 
 
 
 
 
 

页脚

网易公司版权所有 ©1997-2017