注册 登录  
 加关注
   显示下一条  |  关闭
温馨提示!由于新浪微博认证机制调整,您的新浪微博帐号绑定已过期,请重新绑定!立即重新绑定新浪微博》  |  关闭

Koala++'s blog

计算广告学 RTB

 
 
 

日志

 
 

EM for Gaussian mixtures  

2009-11-21 12:29:54|  分类: 机器学习 |  标签: |举报 |字号 订阅

  下载LOFTER 我的照片书  |

         我只是大概提一下pattern recognition and machine learning中的介绍。

Weka中的EM只是EM算法与k-means结合的一个算法,所以当初我根本就没有想过它会在clusterer包里,因为我以前经常听到的是EMnaïve bayes结合的算法,在weka里的EM 算法假设数据是由Gaussian mixture model产生的,也就是数据是由多个高斯分布的模型线性组合之后的模型产生的,它可以用下式来表示:

              EM - quweiprotoss - Koala++s blog

         这里的pai_k表示的是k个高斯分布前面的系数,而后面的就是高斯分布函数了,K表示的有K个高斯模型。

         这个函数的log of the likelihood function是:

              EM - quweiprotoss - Koala++s blog

         这个函数有三个问题:

1.       有一个高斯分布的的均值恰好等于数据集中的一个样本,它会的likelihood函数如下:

            EM - quweiprotoss - Koala++s blog

当方差趋近于0的时候,那么log likelihood function(就是上式)也趋于无穷,Thus the maximization of the log likelihood function is not a well posed problem because such singularities will always be present and will occur whenever one of the Gaussian components ‘collapses’ onto a specific data point. 这道理也可以很好想通,如果这个高斯分布就是由一个点得到,那么它的均值就是这个点,而方差就是0

 

2.       A further issue in finding maximum likelihood solutions arises from the fact that for any given maximum likelihood solution, a K-component mixture will have a total of K! equivalent solutions corresponding to the K! ways of assigning K sets of parameters to K components. In other words, for any given (nondegenerate) point in the space of parameter values there will be a further K!?1 additional points all of which give rise to exactly the same distribution. This problem is known as identifiability (Casella and Berger, 2002) and is an important issue when we wish to interpret the parameter values discovered by a model. Identifiability will also arise when we discuss models having continuous latent variables in Chapter 12. 这里提到这个问题我还没有太理解它为什么是个问题,它的意思是mu(均值),delta(方差)K组参数可以赋给K(我感觉想成pai(系数)会好一点)component。也就是一共有K!种赋值方式,它被称为identifiability

3.       Maximizing the log likelihood function (9.14) for a Gaussian mixture model turns out to be a more complex problem than for the case of a single Gaussian. The difficulty arises from the presence of the summation over k that appears inside the logarithm in (9.14), so that the logarithm function no longer acts directly on the Gaussian. If we set the derivatives of the log likelihood to zero, we will no longer obtain a closed form solution, as we shall see shortly. 9.14就是我列出的第二个公式,文中解释到它是一个比single Gaussian更复杂的问题,因为对k对求和是在logarithm函数里面,所以logarithm函数不能直接作用于Gaussian(因为Gaussian的后面是以e为底的指数),如果我们直接对log likelihood求导,我们不再能得到一个通用的解。

               EM - quweiprotoss - Koala++s blog

         EM算法的出发点就是如果数据像上图(a)中,已经标明了,数据是由哪个模型产生的,问题就比较简单了,它可以得到下面均值的log likelihood公式:

             EM - quweiprotoss - Koala++s blog

         大概看一下,logarithm函数已经直接作用于高斯函数了,问题已经被大大简化了,可是要解决的问题并没有标明数据是由哪个模型产生的,它就是像(b)图中一样,不知道哪个点是由哪个模型产生的,而EME步骤就是得到像(c)图一样的结果,得到它是由哪个模型产生的。

                             EM - quweiprotoss - Koala++s blog

         上面是EM算法的图示。Plot (a) shows the data points in green, together with the initial configuration of the mixture model in which the one standard-deviation contours for the two Gaussian components are shown as blue and red circles. Plot (b) shows the result of the initial E step, in which each data point is depicted using a proportion of blue ink equal to the posterior probability of having been generated from the blue component, and a corresponding proportion of red ink given by the posterior probability of having been generated by the red component. Thus, points that have a significant probability for belonging to either cluster appear purple. The situation after the first M step is shown in plot (c), in which the mean of the blue Gaussian has moved to the mean of the data set, weighted by the probabilities of each data point belonging to the blue cluster, in other words it has moved to the centre of mass of the blue ink. Similarly, the covariance of the blue Gaussian is set equal to the covariance of the blue ink. Analogous results hold for the red component. Plots (d), (e), and (f) show the results after 2, 5, and 20 complete cycles of EM, respectively. In plot (f) the algorithm is close to convergence. (a)用绿点显示数据集,并用红圈和蓝圈显示了两个高斯component的标准差等值线。图(b)显示了初始的E步骤,用红色表示用红色 component产生的后验概率,蓝色也是同样的情况,还有一些不太确定的点是用紫色显示的,在图(c)中蓝色的高斯的均值已经移到了蓝色数据集的中心上,类似的,蓝色高斯的协方差设为蓝色点的协方差。红色也是相同的情况。图(d),(e),(f)显示的是第2520次迭代后的结果,图(f)是最终接近收敛的结果。

         Note that the EM algorithm takes many more iterations to reach (approximate) convergence compared with the K-means algorithm, and that each cycle requires significantly more computation. It is therefore common to run the K-means algorithm in order to find a suitable initialization for a Gaussian mixture model that is subsequently adapted using EM. The covariance matrices can conveniently be initialized to the sample covariances of the clusters found by the K-means algorithm, and the mixing coefficients can be set to the fractions of data points assigned to the respective clusters. As with gradient-based approaches for maximizing the log likelihood, techniques must be employed to avoid singularities of the likelihood function in which a Gaussian component collapses onto a particular data point. It should be emphasized that there will generally be multiple local maxima of the log likelihood function, and that EM is not guaranteed to find the largest of these maxima. Because the EM algorithm for Gaussian mixtures plays such an important role, we summarize it below.

         注意EM算法与k-means相比要用更多的迭代次数才可以收敛 (接近收敛),并且每次迭代需要更多的计算,所以用K-means算法来找到一个Gaussian混合模型的初始值是很常见的。协方差矩阵也可以很方便地由K-means算法的簇的协方差来初始化,mixing coefficients可设为簇中的样本的比例。用梯度下降的方法来最大化log likelihood,必需用一些方法来避免likelihood函数的singularities,还需要强调的是log likelihood通常有多个局部最大值,EM算法不能保证找到全局最大值。因为针对Gaussian mixturesEM算法是很重要的,我们在下面总结它:

EM - quweiprotoss - Koala++s blogEM - quweiprotoss - Koala++s blog

         坦白地说,我一开始没太反应过来EMK-means有什么关系,我只是感觉它应该去聚类而已,搞了半天,似乎也没什么呀,其实,仔细想想它得到的信息是比EM多的,文中写到:Comparison of the K-means algorithm with the EM algorithm for Gaussian mixtures shows that there is a close similarity. Whereas the K-means algorithm performs a hard assignment of data points to clusters, in which each data point is associated uniquely with one cluster, the EM algorithm makes a soft assignment based on the posterior probabilities. K-meansEM算法比较,它们有很大的相似性,只是K-means算法将数据点hard指定为某一个簇,每一个数据点只能属于一个簇,而EM算法只是计算某个点属于某个簇的概率。

         最后可以用EM的公式推导出来K-means的均值计算公式为:

     EM - quweiprotoss - Koala++s blog

         这个公式与在K-means中的公式是很相似的,并且最大化这个公式,最后的 const和系数 1/2都是可以丢弃的,所以两个公式是相同的。

 

  评论这张
 
阅读(2517)| 评论(0)
推荐 转载

历史上的今天

评论

<#--最新日志,群博日志--> <#--推荐日志--> <#--引用记录--> <#--博主推荐--> <#--随机阅读--> <#--首页推荐--> <#--历史上的今天--> <#--被推荐日志--> <#--上一篇,下一篇--> <#-- 热度 --> <#-- 网易新闻广告 --> <#--右边模块结构--> <#--评论模块结构--> <#--引用模块结构--> <#--博主发起的投票-->
 
 
 
 
 
 
 
 
 
 
 
 
 
 

页脚

网易公司版权所有 ©1997-2017