<?xml version="1.0" encoding="utf-8" standalone="yes" ?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>Machine Learning on Salfo Bikienga</title>
    <link>/tags/machine-learning/</link>
    <description>Recent content in Machine Learning on Salfo Bikienga</description>
    <generator>Hugo -- gohugo.io</generator>
    <language>en-us</language>
    <copyright>&amp;copy; 2017 Salfo Bikienga</copyright>
    <lastBuildDate>Wed, 01 Nov 2017 00:00:00 +0000</lastBuildDate>
    <atom:link href="/tags/machine-learning/" rel="self" type="application/rss+xml" />
    
    <item>
      <title>Introduction to LDA</title>
      <link>/post/introduction-to-lda/</link>
      <pubDate>Wed, 01 Nov 2017 00:00:00 +0000</pubDate>
      
      <guid>/post/introduction-to-lda/</guid>
      <description>&lt;div id=&#34;introduction&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;An important development of text analytics is the invention of the Latent Dirichlet Allocation (LDA) algorithm (also called topic modeling) in 2003. LDA is non negative matrix factorization algorithm. A matrix factorization consists of decomposing a matrix into a product of two or more matrices. It turned out that these linear algebra techniques have applications for data analysis. These applications are generaly referred as data dimension reductions methods. Examples of matrix factorization methods in statistics include Factor Analysis, Principal Component Analysis, and Latent Dirichlet Allocation. They are all latent variables models, which consist of using observed variables to infer the values for unobserved (or hidden) variables. The basic idea of these methods is to find &lt;span class=&#34;math inline&#34;&gt;\(\theta_{D,K}\)&lt;/span&gt; and &lt;span class=&#34;math inline&#34;&gt;\(\phi_{K,V}\)&lt;/span&gt; (two sets of hidden variables) from &lt;span class=&#34;math inline&#34;&gt;\(W_{D,V}\)&lt;/span&gt;, the set of observed variables such that: &lt;span class=&#34;math display&#34;&gt;\[W_{D,V} \simeq \theta_{D,K}*\phi_{K,V}\]&lt;/span&gt; Where &lt;span class=&#34;math inline&#34;&gt;\(D\)&lt;/span&gt; is the number of observations, &lt;span class=&#34;math inline&#34;&gt;\(V\)&lt;/span&gt; is the number of variables; and &lt;span class=&#34;math inline&#34;&gt;\(K\)&lt;/span&gt; is the number of latent variables. We want &lt;span class=&#34;math inline&#34;&gt;\(K&amp;lt;&amp;lt;V\)&lt;/span&gt;, and “hopefully” we can infer a meaning for each of the &lt;span class=&#34;math inline&#34;&gt;\(K\)&lt;/span&gt; columns of &lt;span class=&#34;math inline&#34;&gt;\(\theta_{D,K}\)&lt;/span&gt; from each of the &lt;span class=&#34;math inline&#34;&gt;\(K\)&lt;/span&gt; rows of &lt;span class=&#34;math inline&#34;&gt;\(\phi_{K,V}\)&lt;/span&gt;. Also, it turned out that most information about the observations (rows of &lt;span class=&#34;math inline&#34;&gt;\(W\)&lt;/span&gt;) contained in &lt;span class=&#34;math inline&#34;&gt;\(W_{D,V}\)&lt;/span&gt; is captured in the reduced matrix &lt;span class=&#34;math inline&#34;&gt;\(\theta_{D,K}\)&lt;/span&gt;, hence the idea of data dimension reduction. A major challenge in data dimension reduction is deciding on the appropriate value for &lt;span class=&#34;math inline&#34;&gt;\(K\)&lt;/span&gt;.&lt;/p&gt;
&lt;p&gt;To help fix ideas, let’s assume we have exams scores of 100 students on the following subjects: Gaelic, English, History, Arithmetic, Algebra, Geometry (this example is not a text data example, but it is a good one to illustrate the idea of latent variable models). The dataset is &lt;span class=&#34;math inline&#34;&gt;\(W_{D,V} = W_{100,6}\)&lt;/span&gt;; that is, 100 observations and 6 variables. Let’s assume we want to collapse the &lt;span class=&#34;math inline&#34;&gt;\(V = 6\)&lt;/span&gt; variables into &lt;span class=&#34;math inline&#34;&gt;\(K=2\)&lt;/span&gt; variables. Let’s further assume that the first variable may be termed “Humanities”, and the second variable may be termed “Math” (this is a sensible assumption!). Thus, we want to create a &lt;span class=&#34;math inline&#34;&gt;\(\theta_{100,2}\)&lt;/span&gt; matrix that captures most of the informations about the students grades on 6 subjects. With the two variables, humanities and math, we can quickly learn about the students with the help of, for example, a simple scatterplot. The &lt;span class=&#34;math inline&#34;&gt;\(\phi\)&lt;/span&gt; matrix helps us infer the meanings of the columns of &lt;span class=&#34;math inline&#34;&gt;\(\theta\)&lt;/span&gt; as humanities and math because (hopefully) one row has big coefficients for Gaelic, English, History, and small coefficients for Arithmetic, Algebra, Geometry; and the second row has big coefficients for Arithmetic, Algebra, Geometry, and small coefficients for Gaelic, English, History. I hope this example provides an intuition of what matrix factorization wants to achieve when used for data analysis. The goal is to reduce the dimension of the data, i.e. reduce the number of variables. The meaning of each of the new variables is inferred by guessing a name associated with the original variables with highest coefficients for a given new variable. In the future, I will provide a numerical example within the context of Factor Analysis. Factor analysis is a building block for understanding latent variables models.&lt;/p&gt;
&lt;p&gt;In LDA, the &lt;span class=&#34;math inline&#34;&gt;\(W\)&lt;/span&gt; matrix is a matrix of words counts, the &lt;span class=&#34;math inline&#34;&gt;\(\theta\)&lt;/span&gt; matrix is a matrix of topic proporions within each document, and the &lt;span class=&#34;math inline&#34;&gt;\(\phi\)&lt;/span&gt; matrix is a matrix of each word’s relative importance for each topic.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;lda-the-model&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;LDA: the model&lt;/h1&gt;
&lt;p&gt;This section provides a mathematical exposition of topic modeling and presents the data generative process used to estimate the &lt;span class=&#34;math inline&#34;&gt;\(\theta\)&lt;/span&gt; and &lt;span class=&#34;math inline&#34;&gt;\(\phi\)&lt;/span&gt; matrices. LDA is a generative model that represents documents as being generated by a random mixture over latent variables called topics &lt;span class=&#34;citation&#34;&gt;(David M. Blei, Ng, and Jordan 2003)&lt;/span&gt;. A topic is defined as a distribution over words. For a given corpus (a collection of documents) of D documents each of length &lt;span class=&#34;math inline&#34;&gt;\(N_{d}\)&lt;/span&gt; , the generative process for LDA is defined as follows:&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;&lt;p&gt;For each topic &lt;span class=&#34;math inline&#34;&gt;\(k\)&lt;/span&gt;, draw a distribution over words &lt;span class=&#34;math inline&#34;&gt;\(\phi_k \sim Dirichlet(\beta)\)&lt;/span&gt; with &lt;span class=&#34;math inline&#34;&gt;\(k = \{1, 2, ...K\}\)&lt;/span&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;For each document &lt;span class=&#34;math inline&#34;&gt;\(d\)&lt;/span&gt;:&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;ol style=&#34;list-style-type: lower-alpha&#34;&gt;
&lt;li&gt;&lt;p&gt;Draw a vector of topic proportions &lt;span class=&#34;math inline&#34;&gt;\(\theta_d \sim Dirichlet(\alpha)\)&lt;/span&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;For each word &lt;span class=&#34;math inline&#34;&gt;\(i\)&lt;/span&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;ol style=&#34;list-style-type: lower-roman&#34;&gt;
&lt;li&gt;&lt;p&gt;Draw a topic assignment &lt;span class=&#34;math inline&#34;&gt;\(z_{d,n} \sim multinomial(\theta_d)\)&lt;/span&gt; with &lt;span class=&#34;math inline&#34;&gt;\(z_{d,n} \in \{1, 2, ..., K\}\)&lt;/span&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Draw a word &lt;span class=&#34;math inline&#34;&gt;\(w_{d,v} \sim multinomial(\phi_{k = z_{d,n}})\)&lt;/span&gt; with &lt;span class=&#34;math inline&#34;&gt;\(w_{d,v} \in \{1, 2, ..., V\}\)&lt;/span&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Note: Only the words &lt;span class=&#34;math inline&#34;&gt;\(w\)&lt;/span&gt; are observed.&lt;/p&gt;
&lt;p&gt;The above generative process allows us to construct an explicit closed form expression for the joint likelihood of the observed and hidden variables. Markov Chain Monte Carlo (MCMC), and Variational Bayes methods can then be used to estimate the parameters &lt;span class=&#34;math inline&#34;&gt;\(\theta\)&lt;/span&gt; and &lt;span class=&#34;math inline&#34;&gt;\(\phi\)&lt;/span&gt; (See &lt;span class=&#34;citation&#34;&gt;David M. Blei, Ng, and Jordan (2003)&lt;/span&gt;; &lt;span class=&#34;citation&#34;&gt;David M. Blei (2012)&lt;/span&gt; for further exposition of the method). We derive the posterior distribution of the &lt;span class=&#34;math inline&#34;&gt;\(\theta\)&lt;/span&gt;s and &lt;span class=&#34;math inline&#34;&gt;\(\phi\)&lt;/span&gt;s in the next section.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;deriving-the-theta-and-phi-values&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Deriving the &lt;span class=&#34;math inline&#34;&gt;\(\theta\)&lt;/span&gt; and &lt;span class=&#34;math inline&#34;&gt;\(\phi\)&lt;/span&gt; values&lt;/h1&gt;
&lt;p&gt;A topic &lt;span class=&#34;math inline&#34;&gt;\(\phi_{k}\)&lt;/span&gt; is a distribution over V unique words, each having a proportion &lt;span class=&#34;math inline&#34;&gt;\(\phi_{k,v}\)&lt;/span&gt;; i.e &lt;span class=&#34;math inline&#34;&gt;\(\phi_{k,v}\)&lt;/span&gt; is the relative importance of the word v for the definition (or interpretation) of the topic k. It is assumed that:&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;math display&#34;&gt;\[\phi_{k}\sim Dirichlet_{V}(\beta)\]&lt;/span&gt; That is: &lt;span class=&#34;math display&#34;&gt;\[p(\phi_{k}|\beta)=\frac{1}{B(\beta)}\prod_{v=1}^{V}\phi_{k,v}^{\beta_{v}-1}\]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;Where &lt;span class=&#34;math inline&#34;&gt;\(B(\beta)=\frac{\prod_{v=1}^{V}\Gamma(\beta_{v})}{\Gamma(\sum_{v=1}^{V}\beta_{v})}\)&lt;/span&gt;, and &lt;span class=&#34;math inline&#34;&gt;\(\beta=(\beta_{1},...,\beta_{V})\)&lt;/span&gt;. Since we have K independent topics (by assumption), &lt;span class=&#34;math display&#34;&gt;\[p(\phi|\beta)=\prod_{k=1}^{K}\frac{1}{B(\beta)}\prod_{v=1}^{V}\phi_{k,v}^{\beta_{v}-1}\]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;A document d is a distribution over K topics, each having a proportion &lt;span class=&#34;math inline&#34;&gt;\(\theta_{d,k}\)&lt;/span&gt;, i.e. &lt;span class=&#34;math inline&#34;&gt;\(\theta_{d,k}\)&lt;/span&gt; is the relative importance of the topic k, in the document d. We assume:&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;math display&#34;&gt;\[\theta_{d}\sim Dirichlet_{K}(\alpha)\]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;That is:&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;math display&#34;&gt;\[p(\theta_{d}|\alpha)=\frac{1}{B(\alpha)}\prod_{k=1}^{K}\theta_{d,k}^{\alpha_{k}-1}\]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;And since we have D independent documents (by assumption),&lt;span class=&#34;math display&#34;&gt;\[p(\theta|\alpha)=\prod_{d=1}^{D}\frac{1}{B(\alpha)}\prod_{k=1}^{K}\theta_{d,k}^{\alpha_{k}-1}\]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;It is further assumed that &lt;span class=&#34;math inline&#34;&gt;\(\beta_{v}=\beta\)&lt;/span&gt;, and &lt;span class=&#34;math inline&#34;&gt;\(\alpha_{k}=\alpha\)&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;Let &lt;span class=&#34;math inline&#34;&gt;\(z\)&lt;/span&gt; be the latent topic assignment variable, i.e. the random variable &lt;span class=&#34;math inline&#34;&gt;\(z_{d,n}\)&lt;/span&gt; assigns the word &lt;span class=&#34;math inline&#34;&gt;\(w_{d,n}\)&lt;/span&gt; to the topic k in document &lt;span class=&#34;math inline&#34;&gt;\(d\)&lt;/span&gt;. &lt;span class=&#34;math inline&#34;&gt;\(z_{d,n}\)&lt;/span&gt; is a vector of zeros and 1 at the &lt;span class=&#34;math inline&#34;&gt;\(k^{th}\)&lt;/span&gt; position &lt;span class=&#34;math inline&#34;&gt;\((z_{d,n}=[0,0,...1,0,..])\)&lt;/span&gt;. Define &lt;span class=&#34;math inline&#34;&gt;\(z_{d,n,k}=I(z_{d,n}=k)\)&lt;/span&gt; where &lt;span class=&#34;math inline&#34;&gt;\(I\)&lt;/span&gt; is an indicator function that assigns 1 to the random variable &lt;span class=&#34;math inline&#34;&gt;\(z_{d,n}\)&lt;/span&gt; when &lt;span class=&#34;math inline&#34;&gt;\(z_{d,n}\)&lt;/span&gt; is the topic &lt;span class=&#34;math inline&#34;&gt;\(k\)&lt;/span&gt;, and &lt;span class=&#34;math inline&#34;&gt;\(0\)&lt;/span&gt; otherwise.We assume:&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;math display&#34;&gt;\[z_{d,n}\sim Multinomial(\theta_{d})\]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;That is: &lt;span class=&#34;math display&#34;&gt;\[p(z_{d,n,k}|\theta_{d})  =\theta_{d,k}
=   \prod_{k=1}^{K}\theta_{d,k}^{z_{d,n,k}}\]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;A document is assumed to have &lt;span class=&#34;math inline&#34;&gt;\(N_{d}\)&lt;/span&gt; independent words, and since we assume D independent documents, we have:&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;math display&#34;&gt;\[p(z|\theta)   =\prod_{d=1}^{D}\prod_{n=1}^{N_{d}}\prod_{k=1}^{K}\theta_{d,k}^{z_{d,n,k}}\]&lt;/span&gt; &lt;span class=&#34;math display&#34;&gt;\[= \prod_{d=1}^{D}\prod_{k=1}^{K}\prod_{n=1}^{N_{d}}\theta_{d,k}^{z_{d,n,k}}\]&lt;/span&gt; &lt;span class=&#34;math display&#34;&gt;\[= \prod_{d=1}^{D}\prod_{k=1}^{K}\prod_{v=1}^{V}\theta_{d,k}^{n_{d,v}*z_{d,v,k}}\]&lt;/span&gt; &lt;span class=&#34;math display&#34;&gt;\[= \prod_{d=1}^{D}\prod_{v=1}^{V}\prod_{k=1}^{K}\theta_{d,k}^{n_{d,v}*z_{d,v,k}}\]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;math inline&#34;&gt;\(n_{d,v}\)&lt;/span&gt; is the count of the word v in document d.&lt;/p&gt;
&lt;p&gt;The word &lt;span class=&#34;math inline&#34;&gt;\(w_{d,n}\)&lt;/span&gt; is drawn from the topic’s words distribution &lt;span class=&#34;math inline&#34;&gt;\(\phi_{k}\)&lt;/span&gt;:&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;math display&#34;&gt;\[w_{d,n}|\phi_{k=z_{d,n,k}}\sim Multinomial(\phi_{k=z_{d,n}})\]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;math display&#34;&gt;\[p(w_{d,n} =v|\phi_{k=z_{d,n}})=\phi_{k,v}\]&lt;/span&gt; &lt;span class=&#34;math display&#34;&gt;\[= \prod_{v=1}^{V}\prod_{k=1}^{K}\phi_{k,v}^{w_{d,n,v}*z_{d,n,k}}\]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;math inline&#34;&gt;\(w_{d,n}\)&lt;/span&gt; is a vector of zeros and 1 at the &lt;span class=&#34;math inline&#34;&gt;\(v^{th}\)&lt;/span&gt; position. Define &lt;span class=&#34;math inline&#34;&gt;\(w_{d,n,v}=I(w_{d,n}=v)\)&lt;/span&gt; where &lt;span class=&#34;math inline&#34;&gt;\(I\)&lt;/span&gt; is an indicator function that assigns &lt;span class=&#34;math inline&#34;&gt;\(1\)&lt;/span&gt; to the random variable &lt;span class=&#34;math inline&#34;&gt;\(w_{d,n}\)&lt;/span&gt; when &lt;span class=&#34;math inline&#34;&gt;\(w_{d,n}\)&lt;/span&gt; is the word &lt;span class=&#34;math inline&#34;&gt;\(v\)&lt;/span&gt;, and &lt;span class=&#34;math inline&#34;&gt;\(0\)&lt;/span&gt; otherwise.&lt;/p&gt;
&lt;p&gt;There are D independent documents, each having &lt;span class=&#34;math inline&#34;&gt;\(N_{d}\)&lt;/span&gt; independent words, so: &lt;span class=&#34;math display&#34;&gt;\[p(w|\phi)=\prod_{d=1}^{D}\prod_{n=1}^{N_{d}}\prod_{v=1}^{V}\prod_{k=1}^{K}\phi_{k,v}^{w_{d,n,v}*z_{d,n,k}}\]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;math display&#34;&gt;\[p(w|\phi)=\prod_{d=1}^{D}\prod_{v=1}^{V}\prod_{k=1}^{K}\phi_{k,v}^{n_{d,v}*z_{d,v,k}}\]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;The joint distribution of the observed words w and unobserved (or hidden variables) &lt;span class=&#34;math inline&#34;&gt;\(\theta\)&lt;/span&gt;, &lt;span class=&#34;math inline&#34;&gt;\(z\)&lt;/span&gt;, and &lt;span class=&#34;math inline&#34;&gt;\(\phi\)&lt;/span&gt; is given by:&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;math display&#34;&gt;\[P(w,z,\theta,\phi|\alpha,\beta)=p(\theta|\alpha)p(z|\theta)p(w|\phi,z)p(\phi|\beta)\]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;The goal is to get the posterior distribution of the unobserved variables: &lt;span class=&#34;math display&#34;&gt;\[p(z,\theta,\phi|w,\alpha,\beta)=\frac{P(w,z,\theta,\phi|\alpha,\beta)}{\int\int\sum_{z}P(w,z,\theta,\phi|\alpha,\beta)d\theta d\phi}\]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;math inline&#34;&gt;\(\int\int\sum_{z}P(w,z,\theta,\phi|\alpha,\beta)d\theta d\phi\)&lt;/span&gt; is intractable, so approximation methods are used to approximate the posterior distribution. The seminal paper of LDA &lt;span class=&#34;citation&#34;&gt;(David M. Blei, Ng, and Jordan 2003)&lt;/span&gt; uses the Mean Field Variational Bayes (an optimization method) to approximate the posteriors distribution (See &lt;span class=&#34;citation&#34;&gt;Bishop (2006)&lt;/span&gt;, pp. 462 or &lt;span class=&#34;citation&#34;&gt;David M Blei, Kucukelbir, and McAuliffe (2017)&lt;/span&gt; for an exposition of the theory of the variational method). The mean field variational inference uses the following approximation: &lt;span class=&#34;math display&#34;&gt;\[p(z,\theta,\phi|w,\alpha,\beta)\simeq q(z,\theta,\phi)=q(z)q(\theta)q(\phi)\]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;From &lt;span class=&#34;citation&#34;&gt;Bishop (2006)&lt;/span&gt;, [p. 466], we have: &lt;span class=&#34;math display&#34;&gt;\[q^{*}(z)\propto exp\left\{ E_{\theta,\phi}\left[log(p(z|\theta))+log(p(w|\phi,z))\right]\right\}\]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;math display&#34;&gt;\[q^{*}(\theta)\propto exp\left\{ E_{z,\phi}\left[log(p(\theta|\alpha))+log(p(z|\theta))\right]\right\}\]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;math display&#34;&gt;\[q^{*}(\phi)\propto exp\left\{ E_{\theta,z}\left[log(p(\phi|\beta))+log(p(w|\phi,z))\right]\right\}\]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;Using the expressions above, we have:&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;math display&#34;&gt;\[log(q^{*}(z)) \propto E_{\theta,\phi}\left[\sum_{d=1}^{D}\sum_{v=1}^{V}\sum_{k=1}^{K}n_{d,v}*z_{d,v,k}\left(log(\theta_{d,k})+log(\phi_{k,v})\right)\right]\]&lt;/span&gt; &lt;span class=&#34;math display&#34;&gt;\[\propto   \sum_{d=1}^{D}\sum_{v=1}^{V}\sum_{k=1}^{K}n_{d,v}*z_{d,v,k}\left(E(log(\theta_{d,k}))+E(log(\phi_{k,v}))\right)\]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;Note that &lt;span class=&#34;math display&#34;&gt;\[x|p\sim Multinomial_{K}(p)\iff log\left(p(x|p)\right)=\sum_{k=1}^{K}x_{k}log(p_{k})\]&lt;/span&gt;, and let’s define &lt;span class=&#34;math inline&#34;&gt;\(log(p_{k})=E(log(\theta_{d,k})+E(log(\phi_{k,v}))\)&lt;/span&gt;, so &lt;span class=&#34;math inline&#34;&gt;\(p_{k}=exp(E(log(\theta_{d,k}))+E(log(\phi_{k,v})))\)&lt;/span&gt;. Thus,&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;math display&#34;&gt;\[q^{*}(z)\propto\prod_{d=1}^{D}\prod_{v=1}^{V}\prod_{k=1}^{K}\left[exp(E(log(\theta_{d,k}))+E(log(\phi_{k,v})))\right]^{n_{d,v}*z_{d,v,k}}\]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;That is, &lt;span class=&#34;math display&#34;&gt;\[z_{d,v}|w_{d},\theta_{d},\phi_{k}\sim Multinomial_{K}(p_{k})\]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;and by the multinomial properties,&lt;span class=&#34;math inline&#34;&gt;\(E(z_{d,v,k})=p_{k}=exp(E(log(\theta_{d,k}))+E(log(\phi_{k,v})))\)&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;math display&#34;&gt;\[q^{*}(\theta) \propto exp\left\{ E_{z}\left[\sum_{d}\sum_{k}(\alpha-1)log(\theta_{d,k})+\sum_{d}\sum_{k}\sum_{v}n_{d,v}*z_{d,v,k}log(\theta_{d,k})\right]\right\}\]&lt;/span&gt; &lt;span class=&#34;math display&#34;&gt;\[= \prod_{d}^{D}\prod_{k=1}^{K}exp\left\{ (\alpha+\sum_{v=1}^{V}n_{d,v}E(z_{d,v,k})-1)log(\theta_{d,k})\right\}\]&lt;/span&gt; &lt;span class=&#34;math display&#34;&gt;\[= \prod_{d=1}^{D}\prod_{k=1}^{K}\theta_{d,k}^{\alpha+\sum_{v=1}^{V}n_{d,v}E(z_{d,v,k})-1}\]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;Thus, the approximate posterior distribution of the topics distribution in a document d is:&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;math display&#34;&gt;\[\theta_{d}|w_{d},\alpha=Dirichlet_{K}(\tilde{\alpha}_{d})\]&lt;/span&gt; where &lt;span class=&#34;math inline&#34;&gt;\(\tilde{\alpha}_{d}=\alpha+\sum_{v=1}^{V}n_{d,v}E(z_{d,v,.})\)&lt;/span&gt;. Note that &lt;span class=&#34;math inline&#34;&gt;\(\tilde{\alpha}_{d}\)&lt;/span&gt; is a vector of K dimension.&lt;/p&gt;
&lt;p&gt;By the properties of the Dirichlet distribution, the expected value of &lt;span class=&#34;math inline&#34;&gt;\(\theta_{d}|\tilde{\alpha}_{d}\)&lt;/span&gt; is given by:&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;math display&#34;&gt;\[E(\theta_{d}|\tilde{\alpha_{d}})=\frac{\alpha+\sum_{v=1}^{V}n_{d,v}E(z_{d,v,.})}{\sum_{k=1}^{K}[\alpha+\sum_{v=1}^{V}E(z_{d,v,k})]}\]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;The numerical estimation of &lt;span class=&#34;math inline&#34;&gt;\(E(\theta_{d}|\tilde{\alpha}_{d})\)&lt;/span&gt; gives the estimates of the topics proportions within each document &lt;span class=&#34;math inline&#34;&gt;\(d\)&lt;/span&gt;, &lt;span class=&#34;math inline&#34;&gt;\((\hat\theta_{d})\)&lt;/span&gt;. It is worth noting that &lt;span class=&#34;math inline&#34;&gt;\(E(z_{d,v,k})\)&lt;/span&gt; can be interpreted as the responsibility that topic &lt;span class=&#34;math inline&#34;&gt;\(k\)&lt;/span&gt; takes for explaining the observation of the word v in document d. Ignoring for a moment the denominator of equation above, &lt;span class=&#34;math inline&#34;&gt;\(E(\theta_{d,k}|\tilde{\alpha}_{d,k})\)&lt;/span&gt; is similar to a regression equation where &lt;span class=&#34;math inline&#34;&gt;\(n_{d,v}\)&lt;/span&gt; are the observed counts of words in document &lt;span class=&#34;math inline&#34;&gt;\(d\)&lt;/span&gt;, and &lt;span class=&#34;math inline&#34;&gt;\(E(z_{d,v,k})\)&lt;/span&gt; are the parameter estimates (or weight) of the words. That illustrates that the importance of a topic in a document is due to the high presence of words &lt;span class=&#34;math inline&#34;&gt;\((n_{d,v})\)&lt;/span&gt; referring to that topic, and the weight of these words &lt;span class=&#34;math inline&#34;&gt;\((E(z_{d,v,k}))\)&lt;/span&gt;.&lt;/p&gt;
&lt;p&gt;Similarly,&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;math display&#34;&gt;\[q^{*}(\phi)   \propto exp\left\{ E_{z}\left[\sum_{k=1}^{K}\sum_{v=1}^{V}(\beta-1)log(\phi_{k,v})+\sum_{d=1}^{D}\sum_{k=1}^{K}\sum_{v=1}^{V}n_{d,v}*z_{d,v,k}log(\phi_{k,v})\right]\right\}\]&lt;/span&gt; &lt;span class=&#34;math display&#34;&gt;\[= \prod_{k=1}^{K}\prod_{v=1}^{V}exp\left\{ (\beta+\sum_{d=1}^{D}n_{d,v}*E(z_{d,v,k})-1)log(\phi_{k,v})\right\}\]&lt;/span&gt; &lt;span class=&#34;math display&#34;&gt;\[= \prod_{k=1}^{K}\prod_{v=1}^{V}\phi_{k,v}^{\beta+\sum_{d=1}^{D}n_{d,v}*E(z_{d,v,k})}\]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;Thus, the approximate posterior distribution of the words distribution in a topic &lt;span class=&#34;math inline&#34;&gt;\(\hat\phi_{k}\)&lt;/span&gt; is:&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;math inline&#34;&gt;\(\phi_{k}|w,\beta\sim Dirichlet_{V}(\tilde{\beta_{k}})\)&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;where &lt;span class=&#34;math inline&#34;&gt;\(\tilde{\beta_{k}}=\beta+\sum_{d=1}^{D}n_{d,v}*E(z_{d,.,k})\)&lt;/span&gt;. Note that &lt;span class=&#34;math inline&#34;&gt;\(\tilde{\beta}_{k}\)&lt;/span&gt; is a vector of V dimension.&lt;/p&gt;
&lt;p&gt;And the expected value of &lt;span class=&#34;math inline&#34;&gt;\(\phi_{k}|\tilde{\beta}_{k}\)&lt;/span&gt; is given by:&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;math display&#34;&gt;\[  
E(\phi_{k}|\tilde{\beta_{k}})=\frac{\beta+\sum_{d=1}^{D}n_{d,v}*E(z_{d,.,k})}{\sum_{v=1}^{V}(\beta+\sum_{d=1}^{D}n_{d,v}*E(z_{d,v,k}))} 
\]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;The numerical estimation of &lt;span class=&#34;math inline&#34;&gt;\(E(\phi_{k}|\tilde{\beta}_{k})\)&lt;/span&gt; gives the estimates of the words relative importance for each topic &lt;span class=&#34;math inline&#34;&gt;\(k\)&lt;/span&gt;, &lt;span class=&#34;math inline&#34;&gt;\((\phi_{k})\)&lt;/span&gt;. Ignoring the denominator in the equation above, &lt;span class=&#34;math inline&#34;&gt;\(E(\phi_{k,v}|\tilde{\beta_{k,v}})\)&lt;/span&gt; is the weighted sum of the the frequencies of the word &lt;span class=&#34;math inline&#34;&gt;\(v\)&lt;/span&gt; in each of the documents &lt;span class=&#34;math inline&#34;&gt;\((n_{d,v})\)&lt;/span&gt;, the weights being the responsibility topic &lt;span class=&#34;math inline&#34;&gt;\(k\)&lt;/span&gt; takes for explaining the observation of the word &lt;span class=&#34;math inline&#34;&gt;\(v\)&lt;/span&gt; in document &lt;span class=&#34;math inline&#34;&gt;\(d\)&lt;/span&gt; &lt;span class=&#34;math inline&#34;&gt;\((E(z_{d,v,k}))\)&lt;/span&gt;.&lt;/p&gt;
&lt;p&gt;Here, we have derived the posteriors expected values of the &lt;span class=&#34;math inline&#34;&gt;\(\theta\)&lt;/span&gt;s and &lt;span class=&#34;math inline&#34;&gt;\(\phi\)&lt;/span&gt;s using the words counts &lt;span class=&#34;math inline&#34;&gt;\(n_{d,v}\)&lt;/span&gt;, which is slightly different from &lt;span class=&#34;citation&#34;&gt;David M. Blei, Ng, and Jordan (2003)&lt;/span&gt;. Posterior formulae similar to the current derived solution can be found in &lt;span class=&#34;citation&#34;&gt;Murphy (2012)&lt;/span&gt;, p. 962.&lt;/p&gt;
&lt;p&gt;In sum, the rows of &lt;span class=&#34;math inline&#34;&gt;\(\phi_{K,V}=\left[E(\phi_{k}|\tilde{\beta}_{k})\right]_{K,V}\)&lt;/span&gt; are useful for interpreting (or identifying) the themes, which relative importance in each document are represented by the columns of &lt;span class=&#34;math inline&#34;&gt;\(\theta_{D,K}=\left[E(\theta_{d}|\tilde{\alpha}_{d})\right]_{D,K}\)&lt;/span&gt;.&lt;/p&gt;
&lt;p&gt;Practical tools for estimating the topics distributions of a corpus exist (see &lt;span class=&#34;citation&#34;&gt;Grun and Hornik (2011)&lt;/span&gt;; &lt;span class=&#34;citation&#34;&gt;Silge and Robinson (2017 Chap. 6)&lt;/span&gt;).&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;references&#34; class=&#34;section level1 unnumbered&#34;&gt;
&lt;h1&gt;References&lt;/h1&gt;
&lt;div id=&#34;refs&#34; class=&#34;references&#34;&gt;
&lt;div id=&#34;ref-Bishop2006&#34;&gt;
&lt;p&gt;Bishop, Christopher M. 2006. &lt;em&gt;Pattern Recognition and Machine Learning&lt;/em&gt;. springer.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;ref-Blei2017&#34;&gt;
&lt;p&gt;Blei, David M, Alp Kucukelbir, and Jon D McAuliffe. 2017. “Variational Inference: A Review for Statisticians.” &lt;em&gt;Journal of the American Statistical Association&lt;/em&gt;, no. just-accepted. Taylor &amp;amp; Francis.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;ref-Blei2012&#34;&gt;
&lt;p&gt;Blei, David M. 2012. “Probabilistic Topic Models.” &lt;em&gt;Commun. ACM&lt;/em&gt; 55 (4). New York, NY, USA: ACM: 77–84. doi:&lt;a href=&#34;https://doi.org/10.1145/2133806.2133826&#34;&gt;10.1145/2133806.2133826&lt;/a&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;ref-Blei2003&#34;&gt;
&lt;p&gt;Blei, David M., Andrew Y. Ng, and Michael I. Jordan. 2003. “Latent Dirichlet Allocation.” &lt;em&gt;J. Mach. Learn. Res.&lt;/em&gt; 3 (March). JMLR.org: 993–1022. &lt;a href=&#34;http://dl.acm.org/citation.cfm?id=944919.944937&#34; class=&#34;uri&#34;&gt;http://dl.acm.org/citation.cfm?id=944919.944937&lt;/a&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;ref-Grun2011&#34;&gt;
&lt;p&gt;Grun, Bettina, and Kurt Hornik. 2011. “Topicmodels: An R Package for Fitting Topic Models.” &lt;em&gt;Journal of Statistical Software, Articles&lt;/em&gt; 40 (13): 1–30. doi:&lt;a href=&#34;https://doi.org/10.18637/jss.v040.i13&#34;&gt;10.18637/jss.v040.i13&lt;/a&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;ref-Murphy2012&#34;&gt;
&lt;p&gt;Murphy, Kevin P. 2012. &lt;em&gt;Machine Learning: A Probabilistic Perspective&lt;/em&gt;. MIT press.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;ref-Silge2017&#34;&gt;
&lt;p&gt;Silge, J., and D. Robinson. 2017. &lt;em&gt;Text Mining with R: A Tidy Approach&lt;/em&gt;. O’Reilly Media, Incorporated. &lt;a href=&#34;https://books.google.com/books?id=7bQzMQAACAAJ&#34; class=&#34;uri&#34;&gt;https://books.google.com/books?id=7bQzMQAACAAJ&lt;/a&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>Coming Soon</title>
      <link>/project/coming-soon/</link>
      <pubDate>Fri, 13 Oct 2017 00:00:00 +0000</pubDate>
      
      <guid>/project/coming-soon/</guid>
      <description></description>
    </item>
    
  </channel>
</rss>
