注册 登录  
 加关注
   显示下一条  |  关闭
温馨提示!由于新浪微博认证机制调整,您的新浪微博帐号绑定已过期,请重新绑定!立即重新绑定新浪微博》  |  关闭

Koala++'s blog

计算广告学 RTB

 
 
 

日志

 
 

Weka开发[24]——Apriori源代码分析(2)  

2009-08-04 20:15:23|  分类: 机器学习 |  标签: |举报 |字号 订阅

  下载LOFTER 我的照片书  |

    接下来就要看findLargeItemSets这个函数中的do/while循环了,m_Ls是保存所有项的一个集合,那kMinusOneSets就是上一个项集,接下来是AprioriItemSet.mergeAllItemSets,但是这个函数在执行第一次的时候,它并看不出来什么作用:

public static FastVector mergeAllItemSets(FastVector itemSets, int size,

       int totalTrans) {

 

    FastVector newVector = new FastVector();

    ItemSet result;

    int numFound, k;

 

    for (int i = 0; i < itemSets.size(); i++) {

       ItemSet first = (ItemSet) itemSets.elementAt(i);

        out: for (int j = i + 1; j < itemSets.size(); j++) {

           ItemSet second = (ItemSet) itemSets.elementAt(j);

           result = new AprioriItemSet(totalTrans);

           result.m_items = new int[first.m_items.length];

 

           // Find and copy common prefix of size 'size'

           numFound = 0;

           k = 0;

           while (numFound < size) {

              if (first.m_items[k] == second.m_items[k]) {

                  if (first.m_items[k] != -1)

                     numFound++;

                  result.m_items[k] = first.m_items[k];

              } else

                  break out;

              k++;

           }

 

           // Check difference

           while (k < first.m_items.length) {

              if ((first.m_items[k] != -1) && (second.m_items[k] != -1))

                  break;

              else {

                  if (first.m_items[k] != -1)

                     result.m_items[k] = first.m_items[k];

                  else

                     result.m_items[k] = second.m_items[k];

              }

              k++;

           }

           if (k == first.m_items.length) {

              result.m_counter = 0;

              newVector.addElement(result);

           }

       }

    }

    return newVector;

}

         这里有一个out标签,也就是break out执行后会跳到out那里接着执行,很象goto。这里的size是我们所需要的多少项的,这里first.m_items[k]要等于second.m_items[k]的原因是在相应的特征上值一定要相同才可以合并成一个新的规则,并且要注意,在注释的那一行中也提到了,就是前面部分一定要一样,大小为size。而在下面就是给result.items赋值余下的部分赋值,这里first.m_items[k]second.m_items[i]不能同时为-1,因为同时为-1又多出来一项,所以就要去除这种情况。

下面是ItermSet.getHashtable

public static Hashtable getHashtable(FastVector itemSets, int initialSize) {

 

    Hashtable hashtable = new Hashtable(initialSize);

 

    for (int i = 0; i < itemSets.size(); i++) {

       ItemSet current = (ItemSet) itemSets.elementAt(i);

       hashtable.put(current, new Integer(current.m_counter));

    }

    return hashtable;

}

         Hashtablekey是一个ItemSet,而值是它的计数。下面是ItemSet.pruneItemSet

public static FastVector pruneItemSets(FastVector toPrune,

       Hashtable kMinusOne) {

 

    FastVector newVector = new FastVector(toPrune.size());

    int help, j;

 

    for (int i = 0; i < toPrune.size(); i++) {

       ItemSet current = (ItemSet) toPrune.elementAt(i);

       for (j = 0; j < current.m_items.length; j++)

           if (current.m_items[j] != -1) {

              help = current.m_items[j];

              current.m_items[j] = -1;

              if (kMinusOne.get(current) == null) {

                  current.m_items[j] = help;

                  break;

              } else {

                  current.m_items[j] = help;

              }

           }

       if (j == current.m_items.length)

           newVector.addElement(current);

    }

    return newVector;

}

         pruneItemSets里面的前面这一段代码,主要是为了让m_item[j]这项为-1来,再用kMinusOne.get,如果为空,那么就break,如果这项中每一个(n-1)个特征值都是存在的,那么才将它放到newVector中去。似乎完全不明白这在做什么,现在看findRulesQuickly

private void findRulesQuickly() throws Exception {

 

    FastVector[] rules;

 

    // Build rules

    for (int j = 1; j < m_Ls.size(); j++) {

       FastVector currentItemSets = (FastVector) m_Ls.elementAt(j);

       Enumeration enumItemSets = currentItemSets.elements();

       while (enumItemSets.hasMoreElements()) {

           AprioriItemSet currentItemSet = (AprioriItemSet) enumItemSets

                  .nextElement();

           // AprioriItemSet currentItemSet = new

           // AprioriItemSet((ItemSet)enumItemSets.nextElement());

           rules = currentItemSet.generateRules(m_minMetric,

              m_hashtables,j + 1);

           for (int k = 0; k < rules[0].size(); k++) {

              m_allTheRules[0].addElement(rules[0].elementAt(k));

              m_allTheRules[1].addElement(rules[1].elementAt(k));

              m_allTheRules[2].addElement(rules[2].elementAt(k));

           }

       }

    }

}

         现在的问题是,如果这是第一次执行,那么我们的m_Ls大小为1,也就是什么都不执行,直接跳过去了。其中的generateRules如下:

public FastVector[] generateRules(double minConfidence,

       FastVector hashtables, int numItemsInSet) {

 

    FastVector premises = new FastVector(), consequences = new

       FastVector(), conf = new FastVector();

    FastVector[] rules = new FastVector[3], moreResults;

    AprioriItemSet premise, consequence;

    Hashtable hashtable = (Hashtable) hashtables

           .elementAt(numItemsInSet - 2);

 

    // Generate all rules with one item in the consequence.

    for (int i = 0; i < m_items.length; i++)

       if (m_items[i] != -1) {

           premise = new AprioriItemSet(m_totalTransactions);

           consequence = new AprioriItemSet(m_totalTransactions);

           premise.m_items = new int[m_items.length];

           consequence.m_items = new int[m_items.length];

           consequence.m_counter = m_counter;

 

           for (int j = 0; j < m_items.length; j++)

              consequence.m_items[j] = -1;

           System.arraycopy(m_items, 0, premise.m_items, 0,

                  m_items.length);

           premise.m_items[i] = -1;

 

           consequence.m_items[i] = m_items[i];

           premise.m_counter = ((Integer) hashtable.get(premise))

                  .intValue();

           premises.addElement(premise);

           consequences.addElement(consequence);

           conf.addElement(new Double(confidenceForRule(premise,

                  consequence)));

       }

    rules[0] = premises;

    rules[1] = consequences;

    rules[2] = conf;

    pruneRules(rules, minConfidence);

 

    // Generate all the other rules

    moreResults = moreComplexRules(rules, numItemsInSet, 1,

       minConfidence,

           hashtables);

    if (moreResults != null)

       for (int i = 0; i < moreResults[0].size(); i++) {

           rules[0].addElement(moreResults[0].elementAt(i));

           rules[1].addElement(moreResults[1].elementAt(i));

           rules[2].addElement(moreResults[2].elementAt(i));

       }

    return rules;

}

         其中比较重要的有两句就是premise.m_items[i]=-1表示我们把premise去掉一项,还将这一项做为consequence,其中的confidenceForRule没什么好讲的,就是公式:

public static double confidenceForRule(AprioriItemSet premise,

       AprioriItemSet consequence) {

 

    return (double) consequence.m_counter / (double) premise.m_counter;

}

         rules这里可以看出来,三个元素就是关联规则的左部,右部,和置信度。

public static void pruneRules(FastVector[] rules, double minConfidence) {

 

    FastVector newPremises = new FastVector(rules[0].size()),

newConsequences = new FastVector(rules[1].size()),

 newConf = new FastVector(rules[2].size());

 

    for (int i = 0; i < rules[0].size(); i++)

       if (!(((Double) rules[2].elementAt(i)).doubleValue() <

             minConfidence)) {

           newPremises.addElement(rules[0].elementAt(i));

           newConsequences.addElement(rules[1].elementAt(i));

           newConf.addElement(rules[2].elementAt(i));

       }

    rules[0] = newPremises;

    rules[1] = newConsequences;

    rules[2] = newConf;

}

         这里对规则进行裁减,如果小于minConfidence就将这个规则删除。

private final FastVector[] moreComplexRules(FastVector[] rules,

       int numItemsInSet, int numItemsInConsequence,

double minConfidence,FastVector hashtables) {

 

    AprioriItemSet newPremise;

    FastVector[] result, moreResults;

    FastVector newConsequences, newPremises = new FastVector(),

newConf = new FastVector();

    Hashtable hashtable;

 

    if (numItemsInSet > numItemsInConsequence + 1) {

       hashtable = (Hashtable) hashtables.elementAt(numItemsInSet

              - numItemsInConsequence - 2);

       newConsequences = mergeAllItemSets(rules[1],

              numItemsInConsequence - 1, m_totalTransactions);

       Enumeration enu = newConsequences.elements();

       while (enu.hasMoreElements()) {

           AprioriItemSet current = (AprioriItemSet) enu.nextElement();

           current.m_counter = m_counter;

           newPremise = subtract(current);

           newPremise.m_counter = ((Integer) hashtable.get(newPremise))

                  .intValue();

           newPremises.addElement(newPremise);

           newConf.addElement(new Double(confidenceForRule(newPremise,

                  current)));

       }

       result = new FastVector[3];

       result[0] = newPremises;

       result[1] = newConsequences;

       result[2] = newConf;

       pruneRules(result, minConfidence);

       moreResults = moreComplexRules(result, numItemsInSet,

              numItemsInConsequence + 1, minConfidence, hashtables);

       if (moreResults != null)

           for (int i = 0; i < moreResults[0].size(); i++) {

              result[0].addElement(moreResults[0].elementAt(i));

              result[1].addElement(moreResults[1].elementAt(i));

              result[2].addElement(moreResults[2].elementAt(i));

           }

       return result;

    } else

       return null;

}

         这里的mergeAllItemSets把所有numItemInConsequnce个的项得到了,然后对得到的newConsequences进行循环,下面有一个subtract

public final AprioriItemSet subtract(AprioriItemSet toSubtract) {

 

    AprioriItemSet result = new AprioriItemSet(m_totalTransactions);

 

    result.m_items = new int[m_items.length];

 

    for (int i = 0; i < m_items.length; i++)

       if (toSubtract.m_items[i] == -1)

           result.m_items[i] = m_items[i];

       else

           result.m_items[i] = -1;

    result.m_counter = 0;

    return result;

}

         也就是如果这里是当toSubtract中的不为-1的值,将值换为-1,如果toSubtract中相应的值为-1,那么就保留原值。下面有一个递归的过程,就是将numItemsInConsequence+1,也就是Consequence中的项数可以是numItemsInConsequence+1个了。而它的值不能大于numTermsInSet。最后在generateRules把两个结果合并了。

         接下来则应该是对规则结果进行排序,先是对support进行排序,再对confidence进行排序,这里之所以要supports[j-i]这样反着写是因为stableSort结果是升序的,stable的意思是稳定,懒地解释了,不懂看一下数据结构吧:

// Sort rules according to their support

int j = m_allTheRules[2].size() - 1;

supports = new double[m_allTheRules[2].size()];

for (int i = 0; i < (j + 1); i++)

    supports[j - i] = ((double) ((ItemSet) m_allTheRules[1]

           .elementAt(j - i)).support()) * (-1);

indices = Utils.stableSort(supports);

for (int i = 0; i < (j + 1); i++) {

    sortedRuleSet[0].addElement(m_allTheRules[0]

           .elementAt(indices[j - i]));

    sortedRuleSet[1].addElement(m_allTheRules[1]

           .elementAt(indices[j - i]));

    sortedRuleSet[2].addElement(m_allTheRules[2]

           .elementAt(indices[j - i]));

    if (m_metricType != CONFIDENCE || m_significanceLevel != -1) {

       sortedRuleSet[3].addElement(m_allTheRules[3]

              .elementAt(indices[j - i]));

       sortedRuleSet[4].addElement(m_allTheRules[4]

              .elementAt(indices[j - i]));

       sortedRuleSet[5].addElement(m_allTheRules[5]

              .elementAt(indices[j - i]));

    }

}

 

// Sort rules according to their confidence

m_allTheRules[0].removeAllElements();

m_allTheRules[1].removeAllElements();

m_allTheRules[2].removeAllElements();

if (m_metricType != CONFIDENCE || m_significanceLevel != -1) {

    m_allTheRules[3].removeAllElements();

    m_allTheRules[4].removeAllElements();

    m_allTheRules[5].removeAllElements();

}

confidences = new double[sortedRuleSet[2].size()];

int sortType = 2 + m_metricType;

 

for (int i = 0; i < sortedRuleSet[2].size(); i++)

    confidences[i] = ((Double) sortedRuleSet[sortType].elementAt(i))

           .doubleValue();

indices = Utils.stableSort(confidences);

for (int i = sortedRuleSet[0].size() - 1; (i >= (sortedRuleSet[0]

       .size() - m_numRules))

       && (i >= 0); i--) {

    m_allTheRules[0].addElement(sortedRuleSet[0]

           .elementAt(indices[i]));

    m_allTheRules[1].addElement(sortedRuleSet[1]

           .elementAt(indices[i]));

    m_allTheRules[2].addElement(sortedRuleSet[2]

           .elementAt(indices[i]));

    if (m_metricType != CONFIDENCE || m_significanceLevel != -1) {

       m_allTheRules[3].addElement(sortedRuleSet[3]

              .elementAt(indices[i]));

       m_allTheRules[4].addElement(sortedRuleSet[4]

              .elementAt(indices[i]));

       m_allTheRules[5].addElement(sortedRuleSet[5]

              .elementAt(indices[i]));

    }

}

         关于对类别规则的分析,我现在用不着,也就没分析。

 

  评论这张
 
阅读(2140)| 评论(4)
推荐 转载

历史上的今天

评论

<#--最新日志,群博日志--> <#--推荐日志--> <#--引用记录--> <#--博主推荐--> <#--随机阅读--> <#--首页推荐--> <#--历史上的今天--> <#--被推荐日志--> <#--上一篇,下一篇--> <#-- 热度 --> <#-- 网易新闻广告 --> <#--右边模块结构--> <#--评论模块结构--> <#--引用模块结构--> <#--博主发起的投票-->
 
 
 
 
 
 
 
 
 
 
 
 
 
 

页脚

网易公司版权所有 ©1997-2017