<?xml version="1.0" encoding="utf-8" standalone="yes" ?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>Topic Modeling on Salfo Bikienga</title>
    <link>/categories/topic-modeling/</link>
    <description>Recent content in Topic Modeling on Salfo Bikienga</description>
    <generator>Hugo -- gohugo.io</generator>
    <language>en-us</language>
    <copyright>&amp;copy; 2017 Salfo Bikienga</copyright>
    <lastBuildDate>Fri, 17 Nov 2017 00:00:00 +0000</lastBuildDate>
    <atom:link href="/categories/topic-modeling/" rel="self" type="application/rss+xml" />
    
    <item>
      <title>Topic modeling: The Intuition</title>
      <link>/post/topic-modeling-the-intuition/</link>
      <pubDate>Fri, 17 Nov 2017 00:00:00 +0000</pubDate>
      
      <guid>/post/topic-modeling-the-intuition/</guid>
      <description>&lt;div id=&#34;introduction&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;Whenever I give a talk on topic modeling to people not familiar with the subject, the usual question I receive is: “can you provide some intuition behind topic modeling?” Another variant of the same question is: “This is magic. How can the computer identify the topics in the documents?”. No! It is not magic. It is Math. I presented the math behind &lt;a href=&#34;http://www.salfobikienga.rbind.io/post/introduction-to-lda/&#34; target=&#34;_blank&#34;&gt;Latent Dirichlet Allocation&lt;/a&gt;, and an &lt;a href=&#34;http://www.salfobikienga.rbind.io/post/topic-modeling-an-application/&#34; target=&#34;_blank&#34;&gt;example apllication&lt;/a&gt; in previous posts. Here is my attempt at providing the intuition from the perspective of someone with basic understanding of simple linear regression, and a bit of matrix algebra.&lt;br /&gt;
Topic modeling is a form of matrix factorization. Though modern topic modeling algorithms involve complex probability theory, the basic intuition can be developed through simple matrix factorization.&lt;br /&gt;
Matrix factorization can be understood as a form of data dimension reduction method. In a world of “big data”, the usefulness of such method is immense. For instance, linear regression, the most used statistical tool in economics is only applicable when &lt;span class=&#34;math inline&#34;&gt;\(n\)&lt;/span&gt;, the number of observations is at least as big as &lt;span class=&#34;math inline&#34;&gt;\(p\)&lt;/span&gt;, the number of variables. When &lt;span class=&#34;math inline&#34;&gt;\(p\)&lt;/span&gt; is too big, we resort to some dimension reduction methods such as choosing a few variables based on theory, using &lt;a href=&#34;https://en.wikipedia.org/wiki/Stepwise_regression&#34; target=&#34;_blank&#34;&gt;stepwise&lt;/a&gt;, or &lt;a href=&#34;https://en.wikipedia.org/wiki/Lasso_(statistics)&#34; target=&#34;_blank&#34;&gt;LASSO&lt;/a&gt; regression. With matrix factorization, we do not have to select variables. We can just “redifined” the variables in a lower dimensional space, that is, convert the &lt;span class=&#34;math inline&#34;&gt;\(p\)&lt;/span&gt; dimensional data into &lt;span class=&#34;math inline&#34;&gt;\(k\)&lt;/span&gt; dimensional data, where &lt;span class=&#34;math inline&#34;&gt;\(k\)&lt;/span&gt; is significantly less than &lt;span class=&#34;math inline&#34;&gt;\(p\)&lt;/span&gt; (&lt;span class=&#34;math inline&#34;&gt;\(k&amp;lt;&amp;lt;p\)&lt;/span&gt;). The question is, how does that make sense? It is just matrix algebra, as you will see very soon.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;the-idea-of-dimension-reduction-from-matrix-factorization-perspective&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;The idea of dimension reduction from matrix factorization perspective&lt;/h1&gt;
&lt;p&gt;1- Consider measures of &lt;code&gt;length&lt;/code&gt;, &lt;code&gt;width&lt;/code&gt;, and &lt;code&gt;depth&lt;/code&gt;. These are three variables, i.e. three dimensional data. If size is enough information we care about, then &lt;code&gt;volume&lt;/code&gt;, that is, &lt;code&gt;Volume = length x width x depth&lt;/code&gt; is a good variable. Thus, we can collapse the three variables (&lt;code&gt;length&lt;/code&gt;, &lt;code&gt;width&lt;/code&gt;, &lt;code&gt;depth&lt;/code&gt;) into a single variable, &lt;code&gt;volume&lt;/code&gt; and preserve the essential information needed.&lt;/p&gt;
&lt;p&gt;2- Consider measures of &lt;code&gt;height&lt;/code&gt;, &lt;code&gt;weigth&lt;/code&gt;, and &lt;code&gt;waist&lt;/code&gt;. These are three variables, i.e. three dimensional data. If size provides enough information for what we need, then some form of linear combination of the three variables (&lt;span class=&#34;math inline&#34;&gt;\(size = b_1\times height + b_2 \times weight + b_3 \times waist\)&lt;/span&gt;) will do. Thus, we collapse the three dimensional data into a one dimensional data.&lt;/p&gt;
&lt;p&gt;3- Consider a dataset of words counts in several documents. Let’s consider the following words: &lt;code&gt;college&lt;/code&gt;, &lt;code&gt;drugs&lt;/code&gt;, &lt;code&gt;education&lt;/code&gt;, &lt;code&gt;graduation&lt;/code&gt;, &lt;code&gt;health&lt;/code&gt;, &lt;code&gt;medicaid&lt;/code&gt;. This is a six dimensional data. If what we care about are the concepts of education and health care, then some form of linear combination of these words counts will do. Our task consists of finding the appropriate weights so that a document having higher counts of education related words than other words gets a high value for the education concept, and low value for the health concept. And a document having higher counts of health related words than other words gets a high value for the health concept, and a low value for the education concept. Thus, we reduce the six dimensional data into two dimensional data, while preserving the essential information we care about.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;the-idea-of-matrix-factorization&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;The idea of matrix factorization&lt;/h1&gt;
The idea of matrix factorization stems from the fact that any matrix can be decomposed into the product of two or more matrices. Let &lt;span class=&#34;math inline&#34;&gt;\(W_{n,p}\)&lt;/span&gt; be a matrix of dataset with &lt;span class=&#34;math inline&#34;&gt;\(n\)&lt;/span&gt; rows and &lt;span class=&#34;math inline&#34;&gt;\(p\)&lt;/span&gt; columns. We can write the same matrix as the product of two matrices, such as:
&lt;span class=&#34;math display&#34;&gt;\[\begin{equation}
W_{n,p} \simeq Z_{n,k}B_{k,p}
\label{eq:fac1}
\end{equation}\]&lt;/span&gt;
&lt;p&gt;It turns out that &lt;span class=&#34;math inline&#34;&gt;\(Z\)&lt;/span&gt; preserves the essential information needed to understand variations between the &lt;span class=&#34;math inline&#34;&gt;\(n\)&lt;/span&gt; observations in &lt;span class=&#34;math inline&#34;&gt;\(W\)&lt;/span&gt;.&lt;/p&gt;
&lt;div id=&#34;illustrative-example&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Illustrative example&lt;/h2&gt;
&lt;p&gt;Let &lt;span class=&#34;math inline&#34;&gt;\(W_{n,p}\)&lt;/span&gt; be a spreadsheet of words counts in &lt;span class=&#34;math inline&#34;&gt;\(n\)&lt;/span&gt; documents. &lt;span class=&#34;math inline&#34;&gt;\(p\)&lt;/span&gt; is the number of unique words, and can be seen as variables. Here, &lt;span class=&#34;math inline&#34;&gt;\(n=6\)&lt;/span&gt;, &lt;span class=&#34;math inline&#34;&gt;\(p = 5\)&lt;/span&gt;, &lt;span class=&#34;math inline&#34;&gt;\(k = 2\)&lt;/span&gt;. Let &lt;code&gt;college&lt;/code&gt;, &lt;code&gt;education&lt;/code&gt;, &lt;code&gt;family&lt;/code&gt;, &lt;code&gt;health&lt;/code&gt;, and &lt;code&gt;medicaid&lt;/code&gt; be respectively the variables names of the &lt;span class=&#34;math inline&#34;&gt;\(W\)&lt;/span&gt; matrix.&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;math display&#34;&gt;\[ 
\underbrace{\begin{bmatrix}
4&amp;amp;6&amp;amp;0&amp;amp;2&amp;amp;2  \\ 
0&amp;amp;0&amp;amp;4&amp;amp;8&amp;amp;12  \\ 
6&amp;amp;9&amp;amp;1&amp;amp;5&amp;amp;6 \\    
2&amp;amp;3&amp;amp;3&amp;amp;7&amp;amp;10 \\
0&amp;amp;0&amp;amp;3&amp;amp;6&amp;amp;9 \\
4&amp;amp;6&amp;amp;1&amp;amp;4&amp;amp;5 \\
    \end{bmatrix}
        }_{\mathbf{W_{6,5}}}
=
\underbrace{\begin{bmatrix} 
2&amp;amp;0  \\
0&amp;amp;4  \\
3&amp;amp;1 \\
1&amp;amp;3 \\
0&amp;amp;3 \\
2&amp;amp;1 \\
    \end{bmatrix}
        }_{\mathbf{Z_{6,2}}} 
\underbrace{\begin{bmatrix} 
2&amp;amp;3&amp;amp;0&amp;amp;1&amp;amp;1 \\
0&amp;amp;0&amp;amp;1&amp;amp;2&amp;amp;3 \\             
    \end{bmatrix}
        }_{\mathbf{B_{2,5}}} 
\]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;It is easy to check that the product holds, that is, &lt;span class=&#34;math inline&#34;&gt;\(Z_{6,2}B_{2,5} = W_{6,5}\)&lt;/span&gt;. &lt;span class=&#34;math inline&#34;&gt;\(Z\)&lt;/span&gt; contains most of the information about the observations contained in &lt;span class=&#34;math inline&#34;&gt;\(W\)&lt;/span&gt;. With &lt;span class=&#34;math inline&#34;&gt;\(Z\)&lt;/span&gt;, we can explore, or study the variation between the observations, easily. &lt;span class=&#34;math inline&#34;&gt;\(Z\)&lt;/span&gt; is a two dimensional representation of &lt;span class=&#34;math inline&#34;&gt;\(W\)&lt;/span&gt;, and a simple scatterplot can be used to explore the data, as shown in the plot below.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;Z &amp;lt;- matrix(c(2, 0,
             0, 4,
             3, 1,
             1, 3,
             0, 3,
             2, 1), byrow = TRUE, nrow = 6)
Z &amp;lt;- data.frame(z1 = Z[,1], z2 = Z[, 2])
plot(x = Z$z1, y = Z$z2, cex = 3)
text(x = Z$z1, y = Z$z2, labels= 1:6, cex= 1)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;/post/2017-12-17-topic-modeling-the-intuition_files/figure-html/unnamed-chunk-1-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;From the plot, we can deduce that observations (or documents) 2, 4 and 5 are close to each other; observations 1, 6 and 3 are also close to each other. The point here is that with a reduced dimension, it is easier to draw some insight from the data. Hence, the benefit of matrix factorization for data analysis.&lt;/p&gt;
&lt;p&gt;For a &lt;strong&gt;predictive modeling&lt;/strong&gt; exercise, we replace the &lt;span class=&#34;math inline&#34;&gt;\(W\)&lt;/span&gt; matrix with the &lt;span class=&#34;math inline&#34;&gt;\(Z\)&lt;/span&gt; matrix, and the usual tools (linear regression, logistic regression, regression tree, etc.) can be used. We do not have to understand the meaning of the new variables &lt;span class=&#34;math inline&#34;&gt;\(Z\)&lt;/span&gt;. We only care about their ability to predict.&lt;br /&gt;
However, for &lt;strong&gt;exploratory&lt;/strong&gt; and &lt;strong&gt;inferential&lt;/strong&gt; data analysis, we want to understand the meaning of the new variables &lt;span class=&#34;math inline&#34;&gt;\(Z\)&lt;/span&gt;. To tell a story, we have to know the meaning of the variables. We infer the meaning of the new variables by inspecting the &lt;span class=&#34;math inline&#34;&gt;\(B\)&lt;/span&gt; matrix. I will explain why that is the case shortly. For now, note that the number of columns of &lt;span class=&#34;math inline&#34;&gt;\(B\)&lt;/span&gt; is the number of columns of the &lt;span class=&#34;math inline&#34;&gt;\(W\)&lt;/span&gt; matrix; and the number of its rows is the number of columns of the &lt;span class=&#34;math inline&#34;&gt;\(Z\)&lt;/span&gt; matrix. Each row of &lt;span class=&#34;math inline&#34;&gt;\(B\)&lt;/span&gt; is used to interpret the meaning of each column of &lt;span class=&#34;math inline&#34;&gt;\(Z\)&lt;/span&gt;. Row &lt;span class=&#34;math inline&#34;&gt;\(i\)&lt;/span&gt; of &lt;span class=&#34;math inline&#34;&gt;\(B\)&lt;/span&gt; is used to infer the meaning of column &lt;span class=&#34;math inline&#34;&gt;\(i\)&lt;/span&gt; of &lt;span class=&#34;math inline&#34;&gt;\(Z\)&lt;/span&gt;. For instance, referring to our illustrative example above, row 1 of &lt;span class=&#34;math inline&#34;&gt;\(B\)&lt;/span&gt; has its biggest values at its first and second column, that is variables 1 and 2 (&lt;code&gt;college&lt;/code&gt; and &lt;code&gt;education&lt;/code&gt;) of the &lt;span class=&#34;math inline&#34;&gt;\(W\)&lt;/span&gt; matrix are dominant in the identification of the meaning of the first &lt;span class=&#34;math inline&#34;&gt;\(Z\)&lt;/span&gt; variable. Likewise, row 2 of &lt;span class=&#34;math inline&#34;&gt;\(B\)&lt;/span&gt; has its biggest values at its two last columns; the variables 4 and 5 (&lt;code&gt;health&lt;/code&gt; and &lt;code&gt;medicaid&lt;/code&gt;) of the &lt;span class=&#34;math inline&#34;&gt;\(W\)&lt;/span&gt; matrix are dominant in the identification of the meaning of the second &lt;span class=&#34;math inline&#34;&gt;\(Z\)&lt;/span&gt; variable. Thus, the columns of &lt;span class=&#34;math inline&#34;&gt;\(Z\)&lt;/span&gt; represent, respectively, measures of education and health concepts.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;finding-z-and-b&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Finding &lt;span class=&#34;math inline&#34;&gt;\(Z\)&lt;/span&gt; and &lt;span class=&#34;math inline&#34;&gt;\(B\)&lt;/span&gt;&lt;/h2&gt;
&lt;p&gt;There are several matrix factorization algorithms. Factor Analysis (FA), Principal Component Analysis (PCA), Non Negative Matrix Factorization (NMF), Probabilistic Semantic Analysis (PLSA) and its variants, etc. Since our goal for this introduction is to present the basic idea, let’s present an algorithm that is closer to something we are all familiar with: Ordinary Least Squares (OLS).&lt;/p&gt;
&lt;div id=&#34;multivariate-ols&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Multivariate OLS&lt;/h3&gt;
&lt;p&gt;From introductory statistics, we know that for: &lt;span class=&#34;math display&#34;&gt;\[y_{n,1} = X_{n,p}\beta_{p,1} + \epsilon_{n,1}\]&lt;/span&gt; the least squares solution for &lt;span class=&#34;math inline&#34;&gt;\(\beta_{p,1}\)&lt;/span&gt; is: &lt;span class=&#34;math display&#34;&gt;\[\hat\beta_{p,1} = (X^tX)^{-1}X^ty\]&lt;/span&gt; We are assuming that &lt;span class=&#34;math inline&#34;&gt;\((X^tX)^{-1}\)&lt;/span&gt; exists. &lt;span class=&#34;math inline&#34;&gt;\(t\)&lt;/span&gt; stands for transpose, and &lt;span class=&#34;math inline&#34;&gt;\(-1\)&lt;/span&gt; stands for inverse.&lt;/p&gt;
&lt;p&gt;In case you do not remember this formula, recall that: &lt;span class=&#34;math display&#34;&gt;\[y_{n,1} = X_{n,p}\beta_{p,1} + \epsilon_{n,1}
\Leftrightarrow 
X^tY = (X^tX)\beta + X^t\epsilon\]&lt;/span&gt; Under the assumptions of no correlation between &lt;span class=&#34;math inline&#34;&gt;\(X\)&lt;/span&gt; and &lt;span class=&#34;math inline&#34;&gt;\(\epsilon\)&lt;/span&gt;, and &lt;span class=&#34;math inline&#34;&gt;\(E(\epsilon) = 0\)&lt;/span&gt;, we can set &lt;span class=&#34;math inline&#34;&gt;\(X^t\epsilon=0\)&lt;/span&gt;. So we have: &lt;span class=&#34;math display&#34;&gt;\[X^tY = (X^tX)\beta \\
\Leftrightarrow \\
(X^tX)^{-1}X^tY = (X^tX)^{-1}(X^tX)\beta \\
\Rightarrow \\
\hat\beta = (X^tX)^{-1}X^tY
\]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;For a more than a single left hand side variable &lt;span class=&#34;math inline&#34;&gt;\(y_{n,1}\)&lt;/span&gt;, the same formula applies; and we have: &lt;span class=&#34;math display&#34;&gt;\[\hat B = (X^tX)^{-1}X^tY\]&lt;/span&gt; where &lt;span class=&#34;math inline&#34;&gt;\(B\)&lt;/span&gt; is a &lt;span class=&#34;math inline&#34;&gt;\(p\times q\)&lt;/span&gt; matrix, and &lt;span class=&#34;math inline&#34;&gt;\(Y\)&lt;/span&gt; is a &lt;span class=&#34;math inline&#34;&gt;\(n \times q\)&lt;/span&gt; matrix.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;multivariate-ols-and-matrix-factorization&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Multivariate OLS and matrix factorization&lt;/h3&gt;
What does multivariate regression have to do with matrix factorization? Note that, ignoring the &lt;span class=&#34;math inline&#34;&gt;\(\epsilon\)&lt;/span&gt;, we could have written:
&lt;span class=&#34;math display&#34;&gt;\[\begin{equation}
Y_{n,q} \simeq X_{n,p}B_{p,q}
\label{eq:ols}
\end{equation}\]&lt;/span&gt;
&lt;p&gt;This equation is very similar to the equation &lt;span class=&#34;math inline&#34;&gt;\(W_{n,p} \simeq Z_{n,k}B_{k,p}\)&lt;/span&gt;, except &lt;span class=&#34;math inline&#34;&gt;\(X\)&lt;/span&gt; is observed for the case of the multivariate OLS.&lt;/p&gt;
&lt;p&gt;In multivariate OLS, we only estimate &lt;span class=&#34;math inline&#34;&gt;\(B\)&lt;/span&gt;. For matrix factorization, we estimate &lt;span class=&#34;math inline&#34;&gt;\(Z\)&lt;/span&gt; and &lt;span class=&#34;math inline&#34;&gt;\(B\)&lt;/span&gt;.&lt;br /&gt;
From &lt;span class=&#34;math inline&#34;&gt;\(W \simeq ZB\)&lt;/span&gt;, we can solve for &lt;span class=&#34;math display&#34;&gt;\[\hat B = (Z^tZ)^{-1}Z^tW\]&lt;/span&gt; or &lt;span class=&#34;math display&#34;&gt;\[\hat Z = WB^t(BB^t)^{-1}\]&lt;/span&gt; The predicted values for &lt;span class=&#34;math inline&#34;&gt;\(W\)&lt;/span&gt; is: &lt;span class=&#34;math display&#34;&gt;\[\hat W = \hat Z \hat B\]&lt;/span&gt; To estimate &lt;span class=&#34;math inline&#34;&gt;\(B\)&lt;/span&gt;, we need &lt;span class=&#34;math inline&#34;&gt;\(Z\)&lt;/span&gt;, and to estimate &lt;span class=&#34;math inline&#34;&gt;\(Z\)&lt;/span&gt; we need &lt;span class=&#34;math inline&#34;&gt;\(B\)&lt;/span&gt;. We do not have either one. The trick is to guess some initial values for &lt;span class=&#34;math inline&#34;&gt;\(Z\)&lt;/span&gt;, and use it to estimate &lt;span class=&#34;math inline&#34;&gt;\(B\)&lt;/span&gt;, then use the estimated &lt;span class=&#34;math inline&#34;&gt;\(B\)&lt;/span&gt; to estimate a new &lt;span class=&#34;math inline&#34;&gt;\(Z\)&lt;/span&gt;. Use the new &lt;span class=&#34;math inline&#34;&gt;\(Z\)&lt;/span&gt; to estimate a new &lt;span class=&#34;math inline&#34;&gt;\(B\)&lt;/span&gt;. Continue the iteration untill some stopping criterion. Thus, we estimate &lt;span class=&#34;math inline&#34;&gt;\(Z\)&lt;/span&gt; and &lt;span class=&#34;math inline&#34;&gt;\(B\)&lt;/span&gt; iteratively (This estimation method is known as Alternating Least Squares). When do we stop the iteration?&lt;/p&gt;
&lt;p&gt;Again, &lt;span class=&#34;math inline&#34;&gt;\(\hat W = \hat Z \hat B\)&lt;/span&gt; is the predicted values for &lt;span class=&#34;math inline&#34;&gt;\(W\)&lt;/span&gt;. We itterate until the distance between &lt;span class=&#34;math inline&#34;&gt;\(W\)&lt;/span&gt; and its predicted value, &lt;span class=&#34;math inline&#34;&gt;\(\hat W\)&lt;/span&gt;, is negligible. There are several distance measures, but let’s keep things simple by using the euclidean distance, or &lt;span class=&#34;math inline&#34;&gt;\(L_2\)&lt;/span&gt; norm: &lt;span class=&#34;math display&#34;&gt;\[Q(\hat Z, \hat B) = ||W-\hat W (\hat Z, \hat B)||_2 = \sqrt{\sum_{i = 1}^n \sum_{j = 1}^p (w_{i,j} - \hat w_{i,j})^2}\]&lt;/span&gt; Thus, we minimize &lt;span class=&#34;math inline&#34;&gt;\(Q\)&lt;/span&gt;, the objective function. Following is an example implementation of a simple alternating least squares algorithm.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;W &amp;lt;- matrix(c(4,    6,    0,    2,    2,
             0,    0,    4,    8,   12,
             6,    9,    1,    5,    6,
             2,    3,    3,    7,   10,
             0,    0,    3,    6,    9,
             2,    6,    1,    4,    5), byrow = TRUE, nrow = 6)

set.seed(3)
Z_init &amp;lt;- abs(round(rnorm(n = 6*2, mean = 0, sd = 2),0))
Z_init &amp;lt;- matrix(Z_init, nrow = 6)

Z &amp;lt;- Z_init
dist_ww &amp;lt;- 1e3
max_iter &amp;lt;- 1000
iter &amp;lt;- 0
while(iter &amp;lt;= max_iter &amp;amp;&amp;amp; dist_ww &amp;gt;= 1e-6) {
  iter &amp;lt;- iter + 1
  ZZ_inv &amp;lt;- solve(t(Z)%*%Z)
  B &amp;lt;- ZZ_inv%*%t(Z)%*%W
  BB_inv &amp;lt;- solve(B%*%t(B))
  Z &amp;lt;- W%*%t(B)%*%BB_inv
  W_hat &amp;lt;- Z%*%B
  dist_ww &amp;lt;- sqrt(sum(W-W_hat)^2)
}
W &amp;lt;- data.frame(W)
names(W) &amp;lt;- c(&amp;quot;college&amp;quot;, &amp;quot;education&amp;quot;, &amp;quot;family&amp;quot;, &amp;quot;health&amp;quot;, &amp;quot;medicaid&amp;quot;)
Z &amp;lt;- data.frame(round(Z, 2))
row.names(Z) &amp;lt;- paste0(&amp;quot;document.&amp;quot;, 1:6)
names(Z) &amp;lt;- c(&amp;quot;Topic.1&amp;quot;, &amp;quot;Topic.2&amp;quot;)
B &amp;lt;- data.frame(round(B, 2), row.names = c(&amp;quot;Topic.1&amp;quot;, &amp;quot;Topic.2&amp;quot;))
names(B) &amp;lt;- c(&amp;quot;college&amp;quot;, &amp;quot;education&amp;quot;, &amp;quot;family&amp;quot;, &amp;quot;health&amp;quot;, &amp;quot;medicaid&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Below is the table of the least squares estimate of&lt;span class=&#34;math inline&#34;&gt;\(B\)&lt;/span&gt;:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;B&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##         college education family health medicaid
## Topic.1    1.18      1.96  -0.02    0.6     0.58
## Topic.2    0.50      0.85   1.11    2.5     3.60&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Observe that row 1 of &lt;span class=&#34;math inline&#34;&gt;\(B\)&lt;/span&gt; has high values in columns 1 and 2 compared to columns 3, 4, and 5; and row 2 has higher values for columns 4 and 5 compared to columns 1, 2, and 3. It is reasonable to infer that row 1 (&lt;code&gt;Topic.1&lt;/code&gt;) refers to education, and row 2 (&lt;code&gt;Topic.2&lt;/code&gt;) refers to health.&lt;/p&gt;
&lt;p&gt;Below is the the table of the least squares estimate of &lt;span class=&#34;math inline&#34;&gt;\(Z\)&lt;/span&gt;:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;Z&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##            Topic.1 Topic.2
## document.1    3.13    0.05
## document.2   -1.55    3.58
## document.3    4.31    0.97
## document.4    0.41    2.71
## document.5   -1.16    2.68
## document.6    2.26    1.03&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Observe that &lt;code&gt;Topic.1&lt;/code&gt; has big values in documents 1, 4, and 6. Likewise, &lt;code&gt;Topic.2&lt;/code&gt; has big values in documents 2, 4, and 5. Hence, we can infer that documents 1, 4, and 6 are mostly about education; and documents 2, 4, and 5 are mostly about health.&lt;/p&gt;
&lt;p&gt;We can use a scatterplot to explore the original five dimensional &lt;span class=&#34;math inline&#34;&gt;\(W\)&lt;/span&gt; data in a two dimensional &lt;span class=&#34;math inline&#34;&gt;\(Z\)&lt;/span&gt; data as follow:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;plot(x = Z$Topic.1, y = Z$Topic.2, cex = 3, 
     xlab = &amp;quot;Topic.1&amp;quot;, ylab = &amp;quot;Topic.2&amp;quot;)
text(x = Z$Topic.1, y = Z$Topic.2, labels= 1:6, cex= 1)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;/post/2017-12-17-topic-modeling-the-intuition_files/figure-html/unnamed-chunk-5-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;uniqueness-of-the-solution&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Uniqueness of the solution&lt;/h3&gt;
&lt;p&gt;The solution is not unique, as you might have noticed (note the difference in Z and B from the illustrative example and the computed Z and B) eventhough &lt;span class=&#34;math inline&#34;&gt;\(W\)&lt;/span&gt; remains the same. To see why, assume &lt;span class=&#34;math inline&#34;&gt;\(T\)&lt;/span&gt; is an orthonormal matrix, that is, &lt;span class=&#34;math inline&#34;&gt;\(T\)&lt;/span&gt; is such that &lt;span class=&#34;math inline&#34;&gt;\(TT^t = I\)&lt;/span&gt;. Then, &lt;span class=&#34;math inline&#34;&gt;\(W \simeq ZB = ZTT^tB = (ZT)(T^tB) = Z^*B^*\)&lt;/span&gt; where &lt;span class=&#34;math inline&#34;&gt;\(Z^* = ZT\)&lt;/span&gt;, and &lt;span class=&#34;math inline&#34;&gt;\(B^* = T^tB\)&lt;/span&gt;. Thus, (&lt;span class=&#34;math inline&#34;&gt;\(Z\)&lt;/span&gt;, &lt;span class=&#34;math inline&#34;&gt;\(B\)&lt;/span&gt;) and (&lt;span class=&#34;math inline&#34;&gt;\(Z^*\)&lt;/span&gt;, &lt;span class=&#34;math inline&#34;&gt;\(B^*\)&lt;/span&gt;) are both equally valid solutions. Therefore, the solution is not unique. This non uniqueness of the solution poses some challenges for inferential studies based on the reduced dimension.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;interpreting-the-new-variables&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Interpreting the new variables&lt;/h3&gt;
&lt;p&gt;Again, we use the rows of the &lt;span class=&#34;math inline&#34;&gt;\(B\)&lt;/span&gt; matrix to infer the meaning of each column of &lt;span class=&#34;math inline&#34;&gt;\(Z\)&lt;/span&gt;. Why? Observed that &lt;span class=&#34;math display&#34;&gt;\[\hat B = (Z^tZ)^{-1}Z^tW\]&lt;/span&gt; Let’s define &lt;span class=&#34;math inline&#34;&gt;\(F = (Z^tZ)^{-1}Z^t\)&lt;/span&gt; with elements &lt;span class=&#34;math inline&#34;&gt;\(f_{i,j}\)&lt;/span&gt;, that is, &lt;span class=&#34;math inline&#34;&gt;\(f_{i,j}\)&lt;/span&gt; is the value in the &lt;span class=&#34;math inline&#34;&gt;\(i^{th}\)&lt;/span&gt; row, &lt;span class=&#34;math inline&#34;&gt;\(j^{th}\)&lt;/span&gt; column of the matrix &lt;span class=&#34;math inline&#34;&gt;\(F\)&lt;/span&gt;. Thus, &lt;span class=&#34;math inline&#34;&gt;\(\hat B = FW\)&lt;/span&gt; &lt;span class=&#34;math display&#34;&gt;\[
\hat B_{k,p}=\begin{bmatrix}b_{1,1} &amp;amp; b_{1,2} &amp;amp; \cdots &amp;amp; b_{1,p}\\
b_{2,1} &amp;amp; b_{2,2} &amp;amp; \cdots &amp;amp; b_{2,p}\\
\vdots &amp;amp; \vdots &amp;amp; \ddots &amp;amp; \vdots\\
b_{k,1} &amp;amp; b_{k,2} &amp;amp; \cdots &amp;amp; b_{k,p}
\end{bmatrix}
=
\begin{bmatrix}f_{1,1} &amp;amp; f_{1,2} &amp;amp; \cdots &amp;amp; f_{1,n}\\
f_{2,1} &amp;amp; f_{2,2} &amp;amp; \cdots &amp;amp; f_{2,n}\\
\vdots &amp;amp; \vdots &amp;amp; \ddots &amp;amp; \vdots\\
f_{k,1} &amp;amp; f_{k,2} &amp;amp; \cdots &amp;amp; f_{k,n}
\end{bmatrix}
\begin{bmatrix}w_{1,1} &amp;amp; w_{1,2} &amp;amp; \cdots &amp;amp; w_{1,p}\\
w_{2,1} &amp;amp; w_{2,2} &amp;amp; \cdots &amp;amp; w_{2,p}\\
\vdots &amp;amp; \vdots &amp;amp; \ddots &amp;amp; \vdots\\
w_{n,1} &amp;amp; w_{n,2} &amp;amp; \cdots &amp;amp; w_{n,p}
\end{bmatrix}
\]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;If you still remember matrix operations from high school, note that: &lt;span class=&#34;math display&#34;&gt;\[b_{1,1} = \sum_{l=1}^nf_{1,l}\times w_{l,1} \\
= f_{1,1}w_{1,1}+f_{1,2}w_{2,1}+f_{1,3}w_{3,1}+\cdots+f_{1,n}w_{n,1}\]&lt;/span&gt; &lt;span class=&#34;math display&#34;&gt;\[b_{1,2} = \sum_{l=1}^nf_{1,l}\times w_{l,2} \\
= f_{1,1}w_{1,2}+f_{1,2}w_{2,2}+f_{1,3}w_{3,2}+\cdots+f_{1,n}w_{n,2}\]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;Observe that the source of any numerical difference between &lt;span class=&#34;math inline&#34;&gt;\(b_{1,1}\)&lt;/span&gt; and &lt;span class=&#34;math inline&#34;&gt;\(b_{1,2}\)&lt;/span&gt; is the numerical difference between the first and second column of &lt;span class=&#34;math inline&#34;&gt;\(W\)&lt;/span&gt; (the &lt;span class=&#34;math inline&#34;&gt;\(f_{i,j}\)&lt;/span&gt; are exactly the same). Also, observe that, whatever &lt;span class=&#34;math inline&#34;&gt;\(F\)&lt;/span&gt; is, &lt;span class=&#34;math inline&#34;&gt;\(b_{1,1}\)&lt;/span&gt; is a total weight of the first variable &lt;span class=&#34;math inline&#34;&gt;\(W_1\)&lt;/span&gt; (say the counts of word 1 in all the documents). Likewise, &lt;span class=&#34;math inline&#34;&gt;\(b_{1,2}\)&lt;/span&gt; is a total weight of the second variable &lt;span class=&#34;math inline&#34;&gt;\(W_2\)&lt;/span&gt; (say the count of the second word in all the documents); and so on untill &lt;span class=&#34;math inline&#34;&gt;\(b_{1,p}\)&lt;/span&gt;. Put differently, &lt;span class=&#34;math inline&#34;&gt;\(b_{1,j}\)&lt;/span&gt; is a total weight of the word &lt;span class=&#34;math inline&#34;&gt;\(W_j\)&lt;/span&gt;. Thus, the coefficients &lt;span class=&#34;math inline&#34;&gt;\([b_{1,1},b_{1,2}, \cdots,b_{1,p}]\)&lt;/span&gt; are the total weight of the words &lt;span class=&#34;math inline&#34;&gt;\(W_1, W_2, \cdots, W_p\)&lt;/span&gt;, respectively. If these are words’ weights, it is natural to use the words with highest weights to name row 1 of &lt;span class=&#34;math inline&#34;&gt;\(B\)&lt;/span&gt;. We name the remaining rows of &lt;span class=&#34;math inline&#34;&gt;\(B\)&lt;/span&gt; in similar fashion.&lt;/p&gt;
&lt;p&gt;Also, observe that the elements of the first row of &lt;span class=&#34;math inline&#34;&gt;\(B\)&lt;/span&gt; are the coefficients of the first column of &lt;span class=&#34;math inline&#34;&gt;\(Z\)&lt;/span&gt;. If row 1 of &lt;span class=&#34;math inline&#34;&gt;\(B\)&lt;/span&gt; is named, say education for example, then the first column of &lt;span class=&#34;math inline&#34;&gt;\(Z\)&lt;/span&gt; is an education variable. Hence, the naming of the columns of &lt;span class=&#34;math inline&#34;&gt;\(Z\)&lt;/span&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;the-values-of-the-new-variables-z&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;The values of the new variables &lt;span class=&#34;math inline&#34;&gt;\(Z\)&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;Again, we have &lt;span class=&#34;math display&#34;&gt;\[\hat Z = WB^t(BB^t)^{-1}\]&lt;/span&gt; Let’s define &lt;span class=&#34;math display&#34;&gt;\[N = B^t(BB^t)^{-1}\]&lt;/span&gt; Then &lt;span class=&#34;math display&#34;&gt;\[\hat Z = WN\]&lt;/span&gt; That is &lt;span class=&#34;math display&#34;&gt;\[
\hat{Z} 
= 
\begin{bmatrix}
z_{1,1} &amp;amp; z_{1,2} &amp;amp; \cdots &amp;amp; z_{1,k}\\
z_{2,1} &amp;amp; z_{2,2} &amp;amp; \cdots &amp;amp; z_{2,k}\\
\vdots &amp;amp; \vdots &amp;amp; \ddots &amp;amp; \vdots\\
z_{n,1} &amp;amp; z_{n,2} &amp;amp; \cdots &amp;amp; z_{n,k}
\end{bmatrix} 
=
\begin{bmatrix}
w_{1,1} &amp;amp; w_{1,2} &amp;amp; \cdots &amp;amp; w_{1,p}\\
w_{2,1} &amp;amp; w_{2,2} &amp;amp; \cdots &amp;amp; w_{2,p}\\
\vdots &amp;amp; \vdots &amp;amp; \ddots &amp;amp; \vdots\\
w_{n,1} &amp;amp; w_{n,2} &amp;amp; \cdots &amp;amp; w_{n,p}
\end{bmatrix}
\begin{bmatrix}
n_{1,1} &amp;amp; n_{1,2} &amp;amp; \cdots &amp;amp; n_{1,k}\\
n_{2,1} &amp;amp; n_{2,2} &amp;amp; \cdots &amp;amp; n_{2,k}\\
\vdots &amp;amp; \vdots &amp;amp; \ddots &amp;amp; \vdots\\
n_{p,1} &amp;amp; n_{p,2} &amp;amp; \cdots &amp;amp; n_{p,k}
\end{bmatrix}
\]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;Observe that &lt;span class=&#34;math display&#34;&gt;\[z_{1,1} = \sum_{m = 1}^p n_{m,1}w_{1,m} \\
 = n_{1,1}w_{1,1}+ n_{2,1}w_{1,2}+n_{3,1}w_{1,3}+\cdots+n_{p,1}w_{1,p}\]&lt;/span&gt; &lt;span class=&#34;math display&#34;&gt;\[z_{1,2} = \sum_{m = 1}^p n_{m,2}w_{1,m} \\
= n_{1,2}w_{1,1}+ n_{2,2}w_{1,2}+n_{3,2}w_{1,3}+\cdots+n_{p,2}w_{1,p}\]&lt;/span&gt; &lt;span class=&#34;math display&#34;&gt;\[z_{2,1} = \sum_{m = 1}^p n_{m,1}w_{2,m} \\
= n_{1,1}w_{2,1}+ n_{2,1}w_{2,2}+n_{3,1}w_{2,3}+\cdots+n_{p,1}w_{2,p}\]&lt;/span&gt; The numerical difference between &lt;span class=&#34;math inline&#34;&gt;\(z_{1,1}\)&lt;/span&gt; and &lt;span class=&#34;math inline&#34;&gt;\(z_{1,2}\)&lt;/span&gt; stems from the numerical difference between the weights in column &lt;span class=&#34;math inline&#34;&gt;\(1\)&lt;/span&gt; and &lt;span class=&#34;math inline&#34;&gt;\(2\)&lt;/span&gt; of the weights matrix &lt;span class=&#34;math inline&#34;&gt;\(N\)&lt;/span&gt; (&lt;span class=&#34;math inline&#34;&gt;\(N\)&lt;/span&gt; can be seen as a weight matrix). The numerical difference between &lt;span class=&#34;math inline&#34;&gt;\(z_{1,1}\)&lt;/span&gt; and &lt;span class=&#34;math inline&#34;&gt;\(z_{2,1}\)&lt;/span&gt; stems from the numerical difference between the words counts in documents &lt;span class=&#34;math inline&#34;&gt;\(1\)&lt;/span&gt; and &lt;span class=&#34;math inline&#34;&gt;\(2\)&lt;/span&gt; of the words counts matrix &lt;span class=&#34;math inline&#34;&gt;\(W\)&lt;/span&gt;.&lt;/p&gt;
&lt;p&gt;Alternatively, we can think of &lt;span class=&#34;math inline&#34;&gt;\(Z\)&lt;/span&gt; as a composite index matrix. &lt;span class=&#34;math inline&#34;&gt;\(z_{i,j}\)&lt;/span&gt; is the value of the index &lt;span class=&#34;math inline&#34;&gt;\(j\)&lt;/span&gt; in document &lt;span class=&#34;math inline&#34;&gt;\(i\)&lt;/span&gt;. For example, &lt;span class=&#34;math inline&#34;&gt;\(z_{1,1}\)&lt;/span&gt; is the value of index &lt;span class=&#34;math inline&#34;&gt;\(1\)&lt;/span&gt; in document &lt;span class=&#34;math inline&#34;&gt;\(1\)&lt;/span&gt;; &lt;span class=&#34;math inline&#34;&gt;\(z_{1,2}\)&lt;/span&gt; is the value of index &lt;span class=&#34;math inline&#34;&gt;\(2\)&lt;/span&gt; in document &lt;span class=&#34;math inline&#34;&gt;\(1\)&lt;/span&gt;. Why different index values for the same document? Because each index assigns different weights to the same words. For index &lt;span class=&#34;math inline&#34;&gt;\(1\)&lt;/span&gt;, the weights are the &lt;span class=&#34;math inline&#34;&gt;\(n_{m,1}\)&lt;/span&gt; (&lt;span class=&#34;math inline&#34;&gt;\(m=\{1, 2,\cdots,p\}\)&lt;/span&gt;). For the index &lt;span class=&#34;math inline&#34;&gt;\(2\)&lt;/span&gt;, the weights are &lt;span class=&#34;math inline&#34;&gt;\(n_{m,2}\)&lt;/span&gt;. And for the &lt;span class=&#34;math inline&#34;&gt;\(k^{th}\)&lt;/span&gt; index, the weights are &lt;span class=&#34;math inline&#34;&gt;\(n_{m,k}\)&lt;/span&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;some-variants-of-the-matrix-factorization&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Some variants of the matrix factorization&lt;/h1&gt;
&lt;p&gt;1- Note that our working example data &lt;span class=&#34;math inline&#34;&gt;\(W\)&lt;/span&gt; is a count data. Naturally, we would want &lt;span class=&#34;math inline&#34;&gt;\(Z\)&lt;/span&gt; and &lt;span class=&#34;math inline&#34;&gt;\(B\)&lt;/span&gt; to have non-negative values. &lt;a href=&#34;https://en.wikipedia.org/wiki/Non-negative_matrix_factorization&#34; target=&#34;_blank&#34;&gt;Non-Negative Matrix Factorization&lt;/a&gt; was invented to force the elements of &lt;span class=&#34;math inline&#34;&gt;\(Z\)&lt;/span&gt;, and &lt;span class=&#34;math inline&#34;&gt;\(B\)&lt;/span&gt; to be positive.&lt;/p&gt;
&lt;p&gt;2- Moreover, the algorithm presented above assumes no probability distribution. Consequently, it is inapropriate to use &lt;span class=&#34;math inline&#34;&gt;\(Z\)&lt;/span&gt; for inferential studies (Inferential studies build on probabilistic assumption of the data generating process). Probabilistic matrix factorization algorithms address these concerns. These methods include probabilistic Principal Component Analysis (PPCA), Multinomial Principal Component Analysis (mPCA), Probabilistic Latent Semantic Analysis (PLSA), Latent Dirichlet Allocation (LDA), etc…&lt;/p&gt;
&lt;p&gt;3- Traditional matrix factorization methods implicitly or explicitly assume multivariate normal distribution, and decomposes the covariance matrix of the data. Factor Analysis (FA) and Principal Component Analysis (PCA) are two examples.&lt;/p&gt;
&lt;p&gt;I hope this introductory exposition of topic modeling provides an intuitive understanding of the why, and how of the subject. Feel free to leave your comments below.&lt;/p&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>Topic Modeling: An Application</title>
      <link>/post/topic-modeling-an-application/</link>
      <pubDate>Sat, 11 Nov 2017 00:00:00 +0000</pubDate>
      
      <guid>/post/topic-modeling-an-application/</guid>
      <description>&lt;section id=&#34;introduction&#34; class=&#34;level1&#34;&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;My work involves the use and the development of topic modeling algorithms. A surprising challenge I have had is communicating the output of topic modeling algorithms to people not familiar with text analytics. Here is my 10 cents explanation of the LDA output to my econ friends.&lt;/p&gt;
&lt;p&gt;The use of text data for &lt;a href=&#34;http://review.chicagobooth.edu/magazine/spring-2015/why-words-are-the-new-numbers&#34; target=&#34;_blank&#34;&gt;economic analysis&lt;/a&gt; is gaining attractions. One popular analytical tool is Latent Dirichlet Allocation (LDA), also called topic modeling &lt;span class=&#34;citation&#34; data-cites=&#34;Blei2003&#34;&gt;(Blei, Ng, and Jordan 2003)&lt;/span&gt;. Succinctly put, topic modeling consists of collapsing a matrix (i.e a spreadsheet) of words counts into a reduced matrix of topics’ proportions within documents. For instance, assume we have a collection of 500 documents, each containing 2000 unique words; this collection of documents (called corpus) can be represented as a dataset of 500 observations and 2000 variables (each word being a variable). Each cell in the matrix represents the count of a word in a document. The matrix is just a regular spreadsheet of data. Clearly, it is almost impossible to draw any insight from that many variables. LDA allows us to collapse the high dimensional dataset into a lower dimension, say a dimension of 10. With 10 variables, there is a hope that some insight can be drawn from the data. Following is a demonstration of LDA.&lt;/p&gt;
&lt;/section&gt;
&lt;section id=&#34;example-data&#34; class=&#34;level1&#34;&gt;
&lt;h1&gt;Example Data&lt;/h1&gt;
&lt;p&gt;Let’s consider a dataset of U.S. governors’ State of the State Addresses (SoSA). In most states, the governor gives a speech, generally in January, in which he/she lays out his/her priorities for the next fiscal year. Part of the goal of the speech is to explain (or justify) the proposed budget, and hopefully convince the state stakeholders to support the proposed budget. A budget proposal usually involves a reallocation of the state resources, which implies cuts and increases in different lines of the state budget. I collected 596 speeches from governors of the 50 states, spanning from 2001 to 2013.&lt;/p&gt;
&lt;p&gt;It is customary in text analytics to delete words that we believe are not “discriminative”. For instance link words such as “the”, “and”, “she”, etc. will not distinguish a Democrat from a Republican. We call this process, pre-processing the data, that is, cleaning the data by removing elements in the texts that we believe are not useful for our analysis.&lt;/p&gt;
&lt;p&gt;After pre-processing the data, I am left with a dataset of 596 observations and 1034 words (or variables). You can take a look at the pre-processed data &lt;a href=&#34;http://rpubs.com/sbikienga/334137&#34; target=&#34;_blank&#34;&gt;here&lt;/a&gt;, or you can download it &lt;a href=&#34;https://github.com/Salfo/States-Addresses/raw/master/data/SoSA_data_df.RData&#34; target=&#34;_blank&#34;&gt;here&lt;/a&gt;. Stemming, that is stripping the words to their roots, is often done to avoid counting related words separately. For example, education, educational, educate are stemmed and become educ.&lt;/p&gt;
&lt;/section&gt;
&lt;section id=&#34;example-application-of-lda&#34; class=&#34;level1&#34;&gt;
&lt;h1&gt;Example application of LDA&lt;/h1&gt;
&lt;p&gt;The goal when using LDA is primarily to reduce the dimension of a counts dataset. The hope is that the reduced dimension preserves the essential information contained in the original dataset. Interestingly, the reduced dimension is often more appropriate for statistical analysis, as it “solves” the overfitting problem associated with high dimensional data. Generally, the overfitting problem arises in situations where &lt;span class=&#34;math inline&#34;&gt;\(n\)&lt;/span&gt;, the number of observations, is not big enough to provide reliable estimates of the &lt;span class=&#34;math inline&#34;&gt;\(p\)&lt;/span&gt; variables’ parameters.&lt;/p&gt;
&lt;p&gt;There are several packages in R to implement the LDA model (&lt;code&gt;lda&lt;/code&gt;, &lt;code&gt;mallet&lt;/code&gt;, and &lt;code&gt;topicmodels&lt;/code&gt;). Here I will use the &lt;code&gt;topicmodels&lt;/code&gt; package as an example.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# install.packages(&amp;quot;topicmodels&amp;quot;) # You should run this code once if you don&amp;#39;t have topicmodels installed
library(topicmodels) # Load the topicmodels package
url &amp;lt;- url(&amp;quot;https://github.com/Salfo/States-Addresses/raw/master/data/SoSA_data_df.RData&amp;quot;)
load(url) # Load the data from the url provided
SoSA_topics &amp;lt;- LDA(SoSA_data_df, # The matrix of words counts
                   k = 2, # The number of topics to construct
                   method = &amp;quot;Gibbs&amp;quot;, # Estimation method
                   control = list(iter = 3000, # Number of iterations
                                  burnin = 1000, # Thow out the first 1000 estimates
                                  seed = 123)) # To get a reproducible results&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Note that LDA is a matrix factorization algorithm, and a matrix factorization consists of decomposing a matrix into the product of two or more matrices. Intuitively, we can write: &lt;span class=&#34;math display&#34;&gt;\[W_{D,V} \simeq \theta_{D,V}\phi_{K,V}\]&lt;/span&gt;&lt;/p&gt;
&lt;section id=&#34;the-reduced-dimension-theta-matrix&#34; class=&#34;level2&#34;&gt;
&lt;h2&gt;The reduced dimension, &lt;span class=&#34;math inline&#34;&gt;\(\theta\)&lt;/span&gt; matrix&lt;/h2&gt;
&lt;p&gt;In this example, &lt;span class=&#34;math inline&#34;&gt;\(D=596\)&lt;/span&gt;, &lt;span class=&#34;math inline&#34;&gt;\(V=1034\)&lt;/span&gt;, &lt;span class=&#34;math inline&#34;&gt;\(K=2\)&lt;/span&gt;. &lt;span class=&#34;math inline&#34;&gt;\(\theta\)&lt;/span&gt; contains the essential information needed to understand the variation between observations, concerning the speeches. For instance, &lt;span class=&#34;math inline&#34;&gt;\(\theta\)&lt;/span&gt; can be used to study how Democrats differ from Republicans regarding the relative importance of themes they cover in their speeches. &lt;span class=&#34;math inline&#34;&gt;\(\theta\)&lt;/span&gt; can be seen as a regular spreadsheet of data, as shown below. For an extended exposition of LDA, see &lt;a href=&#34;http://www.salfobikienga.rbind.io/post/introduction-to-lda/&#34; target=&#34;_blank&#34;&gt;this&lt;/a&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;theta_matrix &amp;lt;- posterior(SoSA_topics)$topics # Extract the theta matrix
theta_matrix &amp;lt;- round(as.data.frame(theta_matrix), digits = 3)
names(theta_matrix) &amp;lt;- paste(&amp;quot;Topic.&amp;quot;, 1:2, sep = &amp;quot;&amp;quot;) # Name the columns
head(theta_matrix, n = 10) # Print out the first 10 observations&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##                       Topic.1 Topic.2
## Alabama_2001_D_1.txt    0.274   0.726
## Alabama_2002_D_2.txt    0.377   0.623
## Alabama_2003_R_3.txt    0.767   0.233
## Alabama_2004_R_4.txt    0.613   0.387
## Alabama_2005_R_5.txt    0.484   0.516
## Alabama_2006_R_6.txt    0.513   0.487
## Alabama_2007_R_7.txt    0.424   0.576
## Alabama_2008_R_8.txt    0.481   0.519
## Alabama_2009_R_9.txt    0.516   0.484
## Alabama_2010_R_10.txt   0.583   0.417&lt;/code&gt;&lt;/pre&gt;
&lt;/section&gt;
&lt;section id=&#34;how-do-we-know-which-themes-are-covered&#34; class=&#34;level2&#34;&gt;
&lt;h2&gt;How do we know which themes are covered?&lt;/h2&gt;
&lt;p&gt;Well, here we imposed the number of themes by setting &lt;span class=&#34;math inline&#34;&gt;\(K=2\)&lt;/span&gt;. To identify the themes, we use the matrix &lt;span class=&#34;math inline&#34;&gt;\(\phi\)&lt;/span&gt;, which presents the relative importance of each word for each theme (or topic).&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;phi_matrix &amp;lt;- posterior(SoSA_topics)$terms # Extract the phi matrix
phi_matrix &amp;lt;- round(phi_matrix, 3) # Round the numbers to 3 decimals
phi_matrix[, 1:20] # Print out the first 20 words&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##    abil  abus academ acceler accept access accomplish accord account
## 1 0.001 0.001  0.000       0  0.001  0.000      0.000      0   0.002
## 2 0.000 0.001  0.001       0  0.000  0.003      0.001      0   0.001
##   achiev acknowledg across action activ actual addit address adequ
## 1  0.001      0.001  0.001  0.001 0.001  0.001 0.003   0.005     0
## 2  0.002      0.000  0.002  0.001 0.001  0.000 0.001   0.001     0
##   administr adopt
## 1     0.003 0.001
## 2     0.000 0.000&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;It might be more helpful to transpose the &lt;span class=&#34;math inline&#34;&gt;\(\phi\)&lt;/span&gt; so that by sorting each topic by decreasing order of the words relative weights we can identify the first few most important (in terms of weight) words for the given topic.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;T_phi_matrix &amp;lt;- as.data.frame(t(phi_matrix))
names(T_phi_matrix) &amp;lt;- paste(&amp;quot;Topic.&amp;quot;, 1:2)
T_phi_matrix[1:20, ] # Print out the first 20 words&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##            Topic. 1 Topic. 2
## abil          0.001    0.000
## abus          0.001    0.001
## academ        0.000    0.001
## acceler       0.000    0.000
## accept        0.001    0.000
## access        0.000    0.003
## accomplish    0.000    0.001
## accord        0.000    0.000
## account       0.002    0.001
## achiev        0.001    0.002
## acknowledg    0.001    0.000
## across        0.001    0.002
## action        0.001    0.001
## activ         0.001    0.001
## actual        0.001    0.000
## addit         0.003    0.001
## address       0.005    0.001
## adequ         0.000    0.000
## administr     0.003    0.000
## adopt         0.001    0.000&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The &lt;code&gt;terms()&lt;/code&gt; function of the &lt;code&gt;topicmodels&lt;/code&gt; package returns a convenient &lt;span class=&#34;math inline&#34;&gt;\(\phi\)&lt;/span&gt; matrix that replaces the words weights by the words themselves, after sorting each row of the &lt;span class=&#34;math inline&#34;&gt;\(\phi\)&lt;/span&gt; matrix.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;terms_matrix &amp;lt;- terms(SoSA_topics, 30) # Extract the first 30 most important words for each topic
terms_matrix[1:15, ] # Print out the first 15 words&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##       Topic 1   Topic 2   
##  [1,] &amp;quot;budget&amp;quot;  &amp;quot;school&amp;quot;  
##  [2,] &amp;quot;fund&amp;quot;    &amp;quot;work&amp;quot;    
##  [3,] &amp;quot;govern&amp;quot;  &amp;quot;educ&amp;quot;    
##  [4,] &amp;quot;peopl&amp;quot;   &amp;quot;help&amp;quot;    
##  [5,] &amp;quot;million&amp;quot; &amp;quot;children&amp;quot;
##  [6,] &amp;quot;work&amp;quot;    &amp;quot;make&amp;quot;    
##  [7,] &amp;quot;make&amp;quot;    &amp;quot;famili&amp;quot;  
##  [8,] &amp;quot;public&amp;quot;  &amp;quot;nation&amp;quot;  
##  [9,] &amp;quot;propos&amp;quot;  &amp;quot;busi&amp;quot;    
## [10,] &amp;quot;servic&amp;quot;  &amp;quot;creat&amp;quot;   
## [11,] &amp;quot;dollar&amp;quot;  &amp;quot;health&amp;quot;  
## [12,] &amp;quot;know&amp;quot;    &amp;quot;student&amp;quot; 
## [13,] &amp;quot;spend&amp;quot;   &amp;quot;invest&amp;quot;  
## [14,] &amp;quot;increas&amp;quot; &amp;quot;teacher&amp;quot; 
## [15,] &amp;quot;program&amp;quot; &amp;quot;care&amp;quot;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;By exploring the most important words for each topic, it seems reasonable to infer that Topic.1 is about “money”, the budget; and Topic.2 is mostly about education.&lt;/p&gt;
&lt;p&gt;In sum, &lt;span class=&#34;math inline&#34;&gt;\(\theta\)&lt;/span&gt; provides the essential information needed to understand variations or differences between observations; &lt;span class=&#34;math inline&#34;&gt;\(\phi\)&lt;/span&gt; is used to infer the meaning of each of the &lt;span class=&#34;math inline&#34;&gt;\(K\)&lt;/span&gt; columns of &lt;span class=&#34;math inline&#34;&gt;\(\theta\)&lt;/span&gt;.&lt;/p&gt;
&lt;/section&gt;
&lt;/section&gt;
&lt;section id=&#34;using-theta-for-statistical-analysis&#34; class=&#34;level1&#34;&gt;
&lt;h1&gt;Using &lt;span class=&#34;math inline&#34;&gt;\(\theta\)&lt;/span&gt; for statistical analysis&lt;/h1&gt;
&lt;p&gt;Of what uses can we make of &lt;span class=&#34;math inline&#34;&gt;\(\theta\)&lt;/span&gt;? Quite a lot!&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;math inline&#34;&gt;\(\theta\)&lt;/span&gt; alone, or combined with other control variables, &lt;span class=&#34;math inline&#34;&gt;\(X\)&lt;/span&gt;, can be used for regular statistical analysis. &lt;span class=&#34;math inline&#34;&gt;\(\theta\)&lt;/span&gt; has been used for economic analyses. &lt;span class=&#34;citation&#34; data-cites=&#34;Brown2016&#34;&gt;(Brown, Crowley, and Elliott 2016)&lt;/span&gt; applied LDA to assess whether the thematic content of financial statement disclosures is informative in predicting intentional misreporting. &lt;span class=&#34;citation&#34; data-cites=&#34;Hansen2016&#34;&gt;(Hansen and McMahon 2016)&lt;/span&gt; uses LDA in a Factor Augmented Vector Autoregressive modeling framework. I have a working paper exploring the relationship between US governors commitments to their economic agenda as stated in their public statements and the expansion of business establishments in their states &lt;span class=&#34;citation&#34; data-cites=&#34;Bikienga2017&#34;&gt;(Bikienga 2017)&lt;/span&gt;. For a survey of the use of LDA and other text analytics tools in economics, see &lt;span class=&#34;citation&#34; data-cites=&#34;Gentzkow2017&#34;&gt;(Gentzkow, Kelly, and Taddy 2017)&lt;/span&gt;.&lt;/p&gt;
&lt;/section&gt;
&lt;section id=&#34;illustration-of-the-use-of-theta&#34; class=&#34;level1&#34;&gt;
&lt;h1&gt;Illustration of the use of &lt;span class=&#34;math inline&#34;&gt;\(\theta\)&lt;/span&gt;&lt;/h1&gt;
&lt;p&gt;Is there any difference between Democrats and Republicans based on the themes covered in their speeches? To answer this question, we can compute the mean values of the topics by party line. Note that D, R, or I is appended to the rownames of the &lt;span class=&#34;math inline&#34;&gt;\(\theta\)&lt;/span&gt; shown above. They stand for Democrat, Republican, or Independent.&lt;/p&gt;
&lt;p&gt;Here, I am using the rownames to construct additional variables (&lt;code&gt;state&lt;/code&gt;, &lt;code&gt;party&lt;/code&gt;, and &lt;code&gt;year&lt;/code&gt;)&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(stringr)
state_vars &amp;lt;- row.names(theta_matrix) %&amp;gt;% 
  str_split(pattern = &amp;quot;_&amp;quot;) %&amp;gt;% as.data.frame() %&amp;gt;% t()
state_vars &amp;lt;- state_vars[, -4]
state_vars &amp;lt;- data.frame(state_vars)
names(state_vars) &amp;lt;- c(&amp;quot;state&amp;quot;, &amp;quot;year&amp;quot;, &amp;quot;party&amp;quot;)
df &amp;lt;- data.frame(theta_matrix, state_vars)
n_obs &amp;lt;- sample(1:596, size = 10)
sample_obs &amp;lt;- df[n_obs,]
sample_obs&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##                            Topic.1 Topic.2       state year party
## Florida_2009_R_94.txt        0.381   0.619     Florida 2009     R
## Kansas_2009_D_171.txt        0.422   0.578      Kansas 2009     D
## Maryland_2003_R_204.txt      0.435   0.565    Maryland 2003     R
## Illinois_2010_D_139.txt      0.579   0.421    Illinois 2010     D
## SouthDakota_2007_R_405.txt   0.378   0.622 SouthDakota 2007     R
## Tennessee_2002_R_411.txt     0.399   0.601   Tennessee 2002     R
## Florida_2004_R_89.txt        0.217   0.783     Florida 2004     R
## RhodeIsland_2002_R_534.txt   0.375   0.625 RhodeIsland 2002     R
## Alabama_2003_R_3.txt         0.767   0.233     Alabama 2003     R
## Minnesota_2008_R_241.txt     0.387   0.613   Minnesota 2008     R&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Compute the topics’ means by party line.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(dplyr)
library(tidyr)
df_by_party &amp;lt;- df %&amp;gt;%
  group_by(party) %&amp;gt;%
summarise(Topic.1 = mean(Topic.1), Topic.2 = mean(Topic.2)) %&amp;gt;%
  gather(Topic, Topic_proportion, Topic.1:Topic.2) %&amp;gt;%
  mutate(Topic_proportion = round(100*Topic_proportion, 0)) %&amp;gt;%
  mutate(pos = c(rep(75, 3), rep(25, 3)))
df_by_party&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## # A tibble: 6 x 4
##   party Topic   Topic_proportion   pos
##   &amp;lt;fct&amp;gt; &amp;lt;chr&amp;gt;              &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt;
## 1 D     Topic.1              46.   75.
## 2 I     Topic.1              62.   75.
## 3 R     Topic.1              51.   75.
## 4 D     Topic.2              54.   25.
## 5 I     Topic.2              38.   25.
## 6 R     Topic.2              49.   25.&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Democrats seem to talk more about education (Topic.2) than Republicans. On average, about 54% of their speeches refers to the education theme, against 49% for Republicans. Conversely, Republicans tend to talk more about budgetary issues than Democrats (51% for Republicans vs. 46% for Democrats).&lt;/p&gt;
&lt;p&gt;Clearly, these differences are not huge, and we cannot put too much stock into it. The goal here is to illustrate how one may use the topics distributions, without going into the intricacies of statistical significance.&lt;/p&gt;
&lt;p&gt;The above table can be visualized with the help of a stacked bar plot.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(ggplot2)
library(ggthemes)
library(extrafont)
#library(plyr)
#library(scales)
fill &amp;lt;- c(&amp;quot;#add8e6&amp;quot;, &amp;quot;#b87333&amp;quot;)
p_party &amp;lt;- ggplot() +
  geom_bar(aes(y = Topic_proportion, x = party, fill = Topic), 
           data = df_by_party, stat=&amp;quot;identity&amp;quot;) +
  geom_text(data=df_by_party, aes(x = party, y = pos, label = paste0(Topic_proportion,&amp;quot;%&amp;quot;)),
            colour=&amp;quot;black&amp;quot;, family=&amp;quot;Tahoma&amp;quot;, size=4) +
  theme(legend.position=&amp;quot;bottom&amp;quot;, legend.direction=&amp;quot;horizontal&amp;quot;,
        legend.title = element_blank()) +
  labs(x=&amp;quot;Political Party&amp;quot;, y=&amp;quot;Percentage&amp;quot;) +
  ggtitle(&amp;quot;Average Proportion of Topic Covered By Party (%)&amp;quot;) +
  scale_fill_manual(values=fill) +
  theme(axis.line = element_line(size=1, colour = &amp;quot;black&amp;quot;),
        panel.grid.major = element_line(colour = &amp;quot;#d3d3d3&amp;quot;), panel.grid.minor = element_blank(),
        panel.border = element_blank(), panel.background = element_blank()) +
  theme(plot.title = element_text(size = 14, family = &amp;quot;Tahoma&amp;quot;, face = &amp;quot;bold&amp;quot;),
        text=element_text(family=&amp;quot;Tahoma&amp;quot;),
        axis.text.x=element_text(colour=&amp;quot;black&amp;quot;, size = 10),
        axis.text.y=element_text(colour=&amp;quot;black&amp;quot;, size = 10))
p_party&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;/post/2017-11-11-topic-modeling-an-application_files/figure-html/unnamed-chunk-10-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;/section&gt;
&lt;section id=&#34;should-we-trust-the-results&#34; class=&#34;level1&#34;&gt;
&lt;h1&gt;Should we trust the results?&lt;/h1&gt;
&lt;p&gt;Yes! We should. A mental block I faced when I started exploring topic modeling is trusting the results. If your program is like mine, latent variables models are not covered in your econometrics classes, even though they are widely used in the economics literature. In Macroeconomics, they are termed Factor Augmented Vector Autoregressive models. In Development Economics, they are used to construct indices &lt;span class=&#34;citation&#34; data-cites=&#34;Berenger2007&#34;&gt;(Bérenger and Verdier-Chouchane 2007, &lt;span class=&#34;citation&#34; data-cites=&#34;Tabellini2010&#34;&gt;@Tabellini2010&lt;/span&gt;)&lt;/span&gt;. Factor models approaches are also used as instruments &lt;span class=&#34;citation&#34; data-cites=&#34;Bai2010&#34;&gt;(Bai and Ng 2010)&lt;/span&gt;.&lt;/p&gt;
&lt;p&gt;But, LDA is just another factor model algorithm. It is closely related to principal component analysis (PCA). In the future, I will present the idea of factor models, and why they are “reliable”.&lt;/p&gt;
&lt;p&gt;#Conclusion&lt;/p&gt;
&lt;p&gt;In sum, topic modeling in general and LDA in particular is a dimension reduction method. It consists of collapsing a matrix of words counts into a reduced matrix of topics distributions. This illustration provides a sense of its usefulness for statistical analysis.&lt;/p&gt;
&lt;/section&gt;
&lt;section id=&#34;references&#34; class=&#34;level1 unnumbered&#34;&gt;
&lt;h1&gt;References&lt;/h1&gt;
&lt;div id=&#34;refs&#34; class=&#34;references&#34;&gt;
&lt;div id=&#34;ref-Bai2010&#34;&gt;
&lt;p&gt;Bai, Jushan, and Serena Ng. 2010. “Instrumental Variable Estimation in a Data Rich Environment.” &lt;em&gt;Econometric Theory&lt;/em&gt; 26 (6). Cambridge University Press: 1577–1606.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;ref-Berenger2007&#34;&gt;
&lt;p&gt;Bérenger, Valérie, and Audrey Verdier-Chouchane. 2007. “Multidimensional Measures of Well-Being: Standard of Living and Quality of Life Across Countries.” &lt;em&gt;World Development&lt;/em&gt; 35 (7). Elsevier: 1259–76.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;ref-Bikienga2017&#34;&gt;
&lt;p&gt;Bikienga, Salfo. 2017. “The Governor as the Entrepreneur in Chief: An Exploratory Analysis.”&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;ref-Blei2003&#34;&gt;
&lt;p&gt;Blei, David M., Andrew Y. Ng, and Michael I. Jordan. 2003. “Latent Dirichlet Allocation.” &lt;em&gt;J. Mach. Learn. Res.&lt;/em&gt; 3 (March). JMLR.org: 993–1022. &lt;a href=&#34;http://dl.acm.org/citation.cfm?id=944919.944937&#34; class=&#34;uri&#34;&gt;http://dl.acm.org/citation.cfm?id=944919.944937&lt;/a&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;ref-Brown2016&#34;&gt;
&lt;p&gt;Brown, Nerissa C, Richard M Crowley, and W Brooke Elliott. 2016. “What Are You Saying? Using Topic to Detect Financial Misreporting.”&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;ref-Gentzkow2017&#34;&gt;
&lt;p&gt;Gentzkow, Matthew, Bryan T Kelly, and Matt Taddy. 2017. “Text as Data.” National Bureau of Economic Research.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;ref-Hansen2016&#34;&gt;
&lt;p&gt;Hansen, Stephen, and Michael McMahon. 2016. “Shocking Language: Understanding the Macroeconomic Effects of Central Bank Communication.” &lt;em&gt;Journal of International Economics&lt;/em&gt; 99. Elsevier: S114–S133.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;ref-Tabellini2010&#34;&gt;
&lt;p&gt;Tabellini, Guido. 2010. “Culture and Institutions: Economic Development in the Regions of Europe.” &lt;em&gt;Journal of the European Economic Association&lt;/em&gt; 8 (4). Oxford University Press: 677–716.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/section&gt;
</description>
    </item>
    
    <item>
      <title>Topic Modeling: An Application</title>
      <link>/post/topic-modeling-an-application/</link>
      <pubDate>Sat, 11 Nov 2017 00:00:00 +0000</pubDate>
      
      <guid>/post/topic-modeling-an-application/</guid>
      <description>&lt;div id=&#34;introduction&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;My work involves the use and the development of topic modeling algorithms. A surprising challenge I have had is communicating the output of topic modeling algorithms to people not familiar with text analytics. Here is my 10 cents explanation of the LDA output to my econ friends.&lt;/p&gt;
&lt;p&gt;The use of text data for &lt;a href=&#34;http://review.chicagobooth.edu/magazine/spring-2015/why-words-are-the-new-numbers&#34; target=&#34;_blank&#34;&gt;economic analysis&lt;/a&gt; is gaining attractions. One popular analytical tool is Latent Dirichlet Allocation (LDA), also called topic modeling &lt;span class=&#34;citation&#34;&gt;(Blei, Ng, and Jordan 2003)&lt;/span&gt;. Succinctly put, topic modeling consists of collapsing a matrix (i.e a spreadsheet) of words counts into a reduced matrix of topics’ proportions within documents. For instance, assume we have a collection of 500 documents, each containing 2000 unique words; this collection of documents (called corpus) can be represented as a dataset of 500 observations and 2000 variables (each word being a variable). Each cell in the matrix represents the count of a word in a document. The matrix is just a regular spreadsheet of data. Clearly, it is almost impossible to draw any insight from that many variables. LDA allows us to collapse the high dimensional dataset into a lower dimension, say a dimension of 10. With 10 variables, there is a hope that some insight can be drawn from the data. Following is a demonstration of LDA.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;example-data&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Example Data&lt;/h1&gt;
&lt;p&gt;Let’s consider a dataset of U.S. governors’ State of the State Addresses (SoSA). In most states, the governor gives a speech, generally in January, in which he/she lays out his/her priorities for the next fiscal year. Part of the goal of the speech is to explain (or justify) the proposed budget, and hopefully convince the state stakeholders to support the proposed budget. A budget proposal usually involves a reallocation of the state resources, which implies cuts and increases in different lines of the state budget. I collected 596 speeches from governors of the 50 states, spanning from 2001 to 2013.&lt;/p&gt;
&lt;p&gt;It is customary in text analytics to delete words that we believe are not “discriminative”. For instance link words such as “the”, “and”, “she”, etc. will not distinguish a Democrat from a Republican. We call this process, pre-processing the data, that is, cleaning the data by removing elements in the texts that we believe are not useful for our analysis.&lt;/p&gt;
&lt;p&gt;After pre-processing the data, I am left with a dataset of 596 observations and 1034 words (or variables). You can take a look at the pre-processed data &lt;a href=&#34;http://rpubs.com/sbikienga/334137&#34; target=&#34;_blank&#34;&gt;here&lt;/a&gt;, or you can download it &lt;a href=&#34;https://github.com/Salfo/States-Addresses/raw/master/data/SoSA_data_df.RData&#34; target=&#34;_blank&#34;&gt;here&lt;/a&gt;. Stemming, that is stripping the words to their roots, is often done to avoid counting related words separately. For example, education, educational, educate are stemmed and become educ.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;example-application-of-lda&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Example application of LDA&lt;/h1&gt;
&lt;p&gt;The goal when using LDA is primarily to reduce the dimension of a counts dataset. The hope is that the reduced dimension preserves the essential information contained in the original dataset. Interestingly, the reduced dimension is often more appropriate for statistical analysis, as it “solves” the overfitting problem associated with high dimensional data. Generally, the overfitting problem arises in situations where &lt;span class=&#34;math inline&#34;&gt;\(n\)&lt;/span&gt;, the number of observations, is not big enough to provide reliable estimates of the &lt;span class=&#34;math inline&#34;&gt;\(p\)&lt;/span&gt; variables’ parameters.&lt;/p&gt;
&lt;p&gt;There are several packages in R to implement the LDA model (&lt;code&gt;lda&lt;/code&gt;, &lt;code&gt;mallet&lt;/code&gt;, and &lt;code&gt;topicmodels&lt;/code&gt;). Here I will use the &lt;code&gt;topicmodels&lt;/code&gt; package as an example.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# install.packages(&amp;quot;topicmodels&amp;quot;) # You should run this code once if you don&amp;#39;t have topicmodels installed
library(topicmodels) # Load the topicmodels package
url &amp;lt;- url(&amp;quot;https://github.com/Salfo/States-Addresses/raw/master/data/SoSA_data_df.RData&amp;quot;)
load(url) # Load the data from the url provided
SoSA_topics &amp;lt;- LDA(SoSA_data_df, # The matrix of words counts
                   k = 2, # The number of topics to construct
                   method = &amp;quot;Gibbs&amp;quot;, # Estimation method
                   control = list(iter = 3000, # Number of iterations
                                  burnin = 1000, # Thow out the first 1000 estimates
                                  seed = 123)) # To get a reproducible results&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Note that LDA is a matrix factorization algorithm, and a matrix factorization consists of decomposing a matrix into the product of two or more matrices. Intuitively, we can write: &lt;span class=&#34;math display&#34;&gt;\[W_{D,V} \simeq \theta_{D,V}\phi_{K,V}\]&lt;/span&gt;&lt;/p&gt;
&lt;div id=&#34;the-reduced-dimension-theta-matrix&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;The reduced dimension, &lt;span class=&#34;math inline&#34;&gt;\(\theta\)&lt;/span&gt; matrix&lt;/h2&gt;
&lt;p&gt;In this example, &lt;span class=&#34;math inline&#34;&gt;\(D=596\)&lt;/span&gt;, &lt;span class=&#34;math inline&#34;&gt;\(V=1034\)&lt;/span&gt;, &lt;span class=&#34;math inline&#34;&gt;\(K=2\)&lt;/span&gt;. &lt;span class=&#34;math inline&#34;&gt;\(\theta\)&lt;/span&gt; contains the essential information needed to understand the variation between observations, concerning the speeches. For instance, &lt;span class=&#34;math inline&#34;&gt;\(\theta\)&lt;/span&gt; can be used to study how Democrats differ from Republicans regarding the relative importance of themes they cover in their speeches. &lt;span class=&#34;math inline&#34;&gt;\(\theta\)&lt;/span&gt; can be seen as a regular spreadsheet of data, as shown below. For an extended exposition of LDA, see &lt;a href=&#34;http://www.salfobikienga.rbind.io/post/introduction-to-lda/&#34; target=&#34;_blank&#34;&gt;this&lt;/a&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;theta_matrix &amp;lt;- posterior(SoSA_topics)$topics # Extract the theta matrix
theta_matrix &amp;lt;- round(as.data.frame(theta_matrix), digits = 3)
names(theta_matrix) &amp;lt;- paste(&amp;quot;Topic.&amp;quot;, 1:2, sep = &amp;quot;&amp;quot;) # Name the columns
head(theta_matrix, n = 10) # Print out the first 10 observations&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##                       Topic.1 Topic.2
## Alabama_2001_D_1.txt    0.274   0.726
## Alabama_2002_D_2.txt    0.377   0.623
## Alabama_2003_R_3.txt    0.767   0.233
## Alabama_2004_R_4.txt    0.613   0.387
## Alabama_2005_R_5.txt    0.484   0.516
## Alabama_2006_R_6.txt    0.513   0.487
## Alabama_2007_R_7.txt    0.424   0.576
## Alabama_2008_R_8.txt    0.481   0.519
## Alabama_2009_R_9.txt    0.516   0.484
## Alabama_2010_R_10.txt   0.583   0.417&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;how-do-we-know-which-themes-are-covered&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;How do we know which themes are covered?&lt;/h2&gt;
&lt;p&gt;Well, here we imposed the number of themes by setting &lt;span class=&#34;math inline&#34;&gt;\(K=2\)&lt;/span&gt;. To identify the themes, we use the matrix &lt;span class=&#34;math inline&#34;&gt;\(\phi\)&lt;/span&gt;, which presents the relative importance of each word for each theme (or topic).&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;phi_matrix &amp;lt;- posterior(SoSA_topics)$terms # Extract the phi matrix
phi_matrix &amp;lt;- round(phi_matrix, 3) # Round the numbers to 3 decimals
phi_matrix[, 1:20] # Print out the first 20 words&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##    abil  abus academ acceler accept access accomplish accord account
## 1 0.001 0.001  0.000       0  0.001  0.000      0.000      0   0.002
## 2 0.000 0.001  0.001       0  0.000  0.003      0.001      0   0.001
##   achiev acknowledg across action activ actual addit address adequ
## 1  0.001      0.001  0.001  0.001 0.001  0.001 0.003   0.005     0
## 2  0.002      0.000  0.002  0.001 0.001  0.000 0.001   0.001     0
##   administr adopt
## 1     0.003 0.001
## 2     0.000 0.000&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;It might be more helpful to transpose the &lt;span class=&#34;math inline&#34;&gt;\(\phi\)&lt;/span&gt; so that by sorting each topic by decreasing order of the words relative weights we can identify the first few most important (in terms of weight) words for the given topic.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;T_phi_matrix &amp;lt;- as.data.frame(t(phi_matrix))
names(T_phi_matrix) &amp;lt;- paste(&amp;quot;Topic.&amp;quot;, 1:2)
T_phi_matrix[1:20, ] # Print out the first 20 words&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##            Topic. 1 Topic. 2
## abil          0.001    0.000
## abus          0.001    0.001
## academ        0.000    0.001
## acceler       0.000    0.000
## accept        0.001    0.000
## access        0.000    0.003
## accomplish    0.000    0.001
## accord        0.000    0.000
## account       0.002    0.001
## achiev        0.001    0.002
## acknowledg    0.001    0.000
## across        0.001    0.002
## action        0.001    0.001
## activ         0.001    0.001
## actual        0.001    0.000
## addit         0.003    0.001
## address       0.005    0.001
## adequ         0.000    0.000
## administr     0.003    0.000
## adopt         0.001    0.000&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The &lt;code&gt;terms()&lt;/code&gt; function of the &lt;code&gt;topicmodels&lt;/code&gt; package returns a convenient &lt;span class=&#34;math inline&#34;&gt;\(\phi\)&lt;/span&gt; matrix that replaces the words weights by the words themselves, after sorting each row of the &lt;span class=&#34;math inline&#34;&gt;\(\phi\)&lt;/span&gt; matrix.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;terms_matrix &amp;lt;- terms(SoSA_topics, 30) # Extract the first 30 most important words for each topic
terms_matrix[1:15, ] # Print out the first 15 words&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##       Topic 1   Topic 2   
##  [1,] &amp;quot;budget&amp;quot;  &amp;quot;school&amp;quot;  
##  [2,] &amp;quot;fund&amp;quot;    &amp;quot;work&amp;quot;    
##  [3,] &amp;quot;govern&amp;quot;  &amp;quot;educ&amp;quot;    
##  [4,] &amp;quot;peopl&amp;quot;   &amp;quot;help&amp;quot;    
##  [5,] &amp;quot;million&amp;quot; &amp;quot;children&amp;quot;
##  [6,] &amp;quot;work&amp;quot;    &amp;quot;make&amp;quot;    
##  [7,] &amp;quot;make&amp;quot;    &amp;quot;famili&amp;quot;  
##  [8,] &amp;quot;public&amp;quot;  &amp;quot;nation&amp;quot;  
##  [9,] &amp;quot;propos&amp;quot;  &amp;quot;busi&amp;quot;    
## [10,] &amp;quot;servic&amp;quot;  &amp;quot;creat&amp;quot;   
## [11,] &amp;quot;dollar&amp;quot;  &amp;quot;health&amp;quot;  
## [12,] &amp;quot;know&amp;quot;    &amp;quot;student&amp;quot; 
## [13,] &amp;quot;spend&amp;quot;   &amp;quot;invest&amp;quot;  
## [14,] &amp;quot;increas&amp;quot; &amp;quot;teacher&amp;quot; 
## [15,] &amp;quot;program&amp;quot; &amp;quot;care&amp;quot;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;By exploring the most important words for each topic, it seems reasonable to infer that Topic.1 is about “money”, the budget; and Topic.2 is mostly about education.&lt;/p&gt;
&lt;p&gt;In sum, &lt;span class=&#34;math inline&#34;&gt;\(\theta\)&lt;/span&gt; provides the essential information needed to understand variations or differences between observations; &lt;span class=&#34;math inline&#34;&gt;\(\phi\)&lt;/span&gt; is used to infer the meaning of each of the &lt;span class=&#34;math inline&#34;&gt;\(K\)&lt;/span&gt; columns of &lt;span class=&#34;math inline&#34;&gt;\(\theta\)&lt;/span&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;using-theta-for-statistical-analysis&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Using &lt;span class=&#34;math inline&#34;&gt;\(\theta\)&lt;/span&gt; for statistical analysis&lt;/h1&gt;
&lt;p&gt;Of what uses can we make of &lt;span class=&#34;math inline&#34;&gt;\(\theta\)&lt;/span&gt;? Quite a lot!&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;math inline&#34;&gt;\(\theta\)&lt;/span&gt; alone, or combined with other control variables, &lt;span class=&#34;math inline&#34;&gt;\(X\)&lt;/span&gt;, can be used for regular statistical analysis. &lt;span class=&#34;math inline&#34;&gt;\(\theta\)&lt;/span&gt; has been used for economic analyses. &lt;span class=&#34;citation&#34;&gt;(Brown, Crowley, and Elliott 2016)&lt;/span&gt; applied LDA to assess whether the thematic content of financial statement disclosures is informative in predicting intentional misreporting. &lt;span class=&#34;citation&#34;&gt;(Hansen and McMahon 2016)&lt;/span&gt; uses LDA in a Factor Augmented Vector Autoregressive modeling framework. I have a working paper exploring the relationship between US governors commitments to their economic agenda as stated in their public statements and the expansion of business establishments in their states &lt;span class=&#34;citation&#34;&gt;(Bikienga 2017)&lt;/span&gt;. For a survey of the use of LDA and other text analytics tools in economics, see &lt;span class=&#34;citation&#34;&gt;(Gentzkow, Kelly, and Taddy 2017)&lt;/span&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;illustration-of-the-use-of-theta&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Illustration of the use of &lt;span class=&#34;math inline&#34;&gt;\(\theta\)&lt;/span&gt;&lt;/h1&gt;
&lt;p&gt;Is there any difference between Democrats and Republicans based on the themes covered in their speeches? To answer this question, we can compute the mean values of the topics by party line. Note that D, R, or I is appended to the rownames of the &lt;span class=&#34;math inline&#34;&gt;\(\theta\)&lt;/span&gt; shown above. They stand for Democrat, Republican, or Independent.&lt;/p&gt;
&lt;p&gt;Here, I am using the rownames to construct additional variables (&lt;code&gt;state&lt;/code&gt;, &lt;code&gt;party&lt;/code&gt;, and &lt;code&gt;year&lt;/code&gt;)&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(stringr)
state_vars &amp;lt;- row.names(theta_matrix) %&amp;gt;% 
  str_split(pattern = &amp;quot;_&amp;quot;) %&amp;gt;% as.data.frame() %&amp;gt;% t()
state_vars &amp;lt;- state_vars[, -4]
state_vars &amp;lt;- data.frame(state_vars)
names(state_vars) &amp;lt;- c(&amp;quot;state&amp;quot;, &amp;quot;year&amp;quot;, &amp;quot;party&amp;quot;)
df &amp;lt;- data.frame(theta_matrix, state_vars)
n_obs &amp;lt;- sample(1:596, size = 10)
sample_obs &amp;lt;- df[n_obs,]
sample_obs&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##                             Topic.1 Topic.2        state year party
## Idaho_2008_R_126.txt          0.648   0.352        Idaho 2008     R
## NewJersey_2009_D_307.txt      0.477   0.523    NewJersey 2009     D
## NewHampshire_2007_D_295.txt   0.277   0.723 NewHampshire 2007     D
## Alabama_2005_R_5.txt          0.484   0.516      Alabama 2005     R
## Tennessee_2013_R_588.txt      0.669   0.331    Tennessee 2013     R
## Wyoming_2010_D_499.txt        0.795   0.205      Wyoming 2010     D
## Washington_2002_D_460.txt     0.446   0.554   Washington 2002     D
## Maine_2005_D_195.txt          0.344   0.656        Maine 2005     D
## Virginia_2011_R_458.txt       0.570   0.430     Virginia 2011     R
## California_2011_D_52.txt      0.679   0.321   California 2011     D&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Compute the topics’ means by party line.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(dplyr)
library(tidyr)
df_by_party &amp;lt;- df %&amp;gt;%
  group_by(party) %&amp;gt;%
summarise(Topic.1 = mean(Topic.1), Topic.2 = mean(Topic.2)) %&amp;gt;%
  gather(Topic, Topic_proportion, Topic.1:Topic.2) %&amp;gt;%
  mutate(Topic_proportion = round(100*Topic_proportion, 0)) %&amp;gt;%
  mutate(pos = c(rep(75, 3), rep(25, 3)))
df_by_party&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## # A tibble: 6 x 4
##   party Topic   Topic_proportion   pos
##   &amp;lt;fct&amp;gt; &amp;lt;chr&amp;gt;              &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt;
## 1 D     Topic.1               46    75
## 2 I     Topic.1               62    75
## 3 R     Topic.1               51    75
## 4 D     Topic.2               54    25
## 5 I     Topic.2               38    25
## 6 R     Topic.2               49    25&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Democrats seem to talk more about education (Topic.2) than Republicans. On average, about 54% of their speeches refers to the education theme, against 49% for Republicans. Conversely, Republicans tend to talk more about budgetary issues than Democrats (51% for Republicans vs. 46% for Democrats).&lt;/p&gt;
&lt;p&gt;Clearly, these differences are not huge, and we cannot put too much stock into it. The goal here is to illustrate how one may use the topics distributions, without going into the intricacies of statistical significance.&lt;/p&gt;
&lt;p&gt;The above table can be visualized with the help of a stacked bar plot.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(ggplot2)
library(ggthemes)
library(extrafont)
#library(plyr)
#library(scales)
fill &amp;lt;- c(&amp;quot;#add8e6&amp;quot;, &amp;quot;#b87333&amp;quot;)
p_party &amp;lt;- ggplot() +
  geom_bar(aes(y = Topic_proportion, x = party, fill = Topic), 
           data = df_by_party, stat=&amp;quot;identity&amp;quot;) +
  geom_text(data=df_by_party, aes(x = party, y = pos, label = paste0(Topic_proportion,&amp;quot;%&amp;quot;)),
            colour=&amp;quot;black&amp;quot;, family=&amp;quot;Tahoma&amp;quot;, size=4) +
  theme(legend.position=&amp;quot;bottom&amp;quot;, legend.direction=&amp;quot;horizontal&amp;quot;,
        legend.title = element_blank()) +
  labs(x=&amp;quot;Political Party&amp;quot;, y=&amp;quot;Percentage&amp;quot;) +
  ggtitle(&amp;quot;Average Proportion of Topic Covered By Party (%)&amp;quot;) +
  scale_fill_manual(values=fill) +
  theme(axis.line = element_line(size=1, colour = &amp;quot;black&amp;quot;),
        panel.grid.major = element_line(colour = &amp;quot;#d3d3d3&amp;quot;), panel.grid.minor = element_blank(),
        panel.border = element_blank(), panel.background = element_blank()) +
  theme(plot.title = element_text(size = 14, family = &amp;quot;Tahoma&amp;quot;, face = &amp;quot;bold&amp;quot;),
        text=element_text(family=&amp;quot;Tahoma&amp;quot;),
        axis.text.x=element_text(colour=&amp;quot;black&amp;quot;, size = 10),
        axis.text.y=element_text(colour=&amp;quot;black&amp;quot;, size = 10))
p_party&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;/post/2017-11-11-topic-modeling-an-application_files/figure-html/unnamed-chunk-10-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;should-we-trust-the-results&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Should we trust the results?&lt;/h1&gt;
&lt;p&gt;Yes! We should. A mental block I faced when I started exploring topic modeling is trusting the results. If your program is like mine, latent variables models are not covered in your econometrics classes, even though they are widely used in the economics literature. In Macroeconomics, they are termed Factor Augmented Vector Autoregressive models. In Development Economics, they are used to construct indices &lt;span class=&#34;citation&#34;&gt;(Bérenger and Verdier-Chouchane 2007; Tabellini 2010)&lt;/span&gt;. Factor models approaches are also used as instruments &lt;span class=&#34;citation&#34;&gt;(Bai and Ng 2010)&lt;/span&gt;.&lt;/p&gt;
&lt;p&gt;But, LDA is just another factor model algorithm. It is closely related to principal component analysis (PCA). In the future, I will present the idea of factor models, and why they are “reliable”.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;conclusion&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Conclusion&lt;/h1&gt;
&lt;p&gt;In sum, topic modeling in general and LDA in particular is a dimension reduction method. It consists of collapsing a matrix of words counts into a reduced matrix of topics distributions. This illustration provides a sense of its usefulness for statistical analysis.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;references&#34; class=&#34;section level1 unnumbered&#34;&gt;
&lt;h1&gt;References&lt;/h1&gt;
&lt;div id=&#34;refs&#34; class=&#34;references&#34;&gt;
&lt;div id=&#34;ref-Bai2010&#34;&gt;
&lt;p&gt;Bai, Jushan, and Serena Ng. 2010. “Instrumental Variable Estimation in a Data Rich Environment.” &lt;em&gt;Econometric Theory&lt;/em&gt; 26 (6). Cambridge University Press: 1577–1606.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;ref-Berenger2007&#34;&gt;
&lt;p&gt;Bérenger, Valérie, and Audrey Verdier-Chouchane. 2007. “Multidimensional Measures of Well-Being: Standard of Living and Quality of Life Across Countries.” &lt;em&gt;World Development&lt;/em&gt; 35 (7). Elsevier: 1259–76.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;ref-Bikienga2017&#34;&gt;
&lt;p&gt;Bikienga, Salfo. 2017. “The Governor as the Entrepreneur in Chief: An Exploratory Analysis.”&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;ref-Blei2003&#34;&gt;
&lt;p&gt;Blei, David M., Andrew Y. Ng, and Michael I. Jordan. 2003. “Latent Dirichlet Allocation.” &lt;em&gt;J. Mach. Learn. Res.&lt;/em&gt; 3 (March). JMLR.org: 993–1022. &lt;a href=&#34;http://dl.acm.org/citation.cfm?id=944919.944937&#34; class=&#34;uri&#34;&gt;http://dl.acm.org/citation.cfm?id=944919.944937&lt;/a&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;ref-Brown2016&#34;&gt;
&lt;p&gt;Brown, Nerissa C, Richard M Crowley, and W Brooke Elliott. 2016. “What Are You Saying? Using Topic to Detect Financial Misreporting.”&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;ref-Gentzkow2017&#34;&gt;
&lt;p&gt;Gentzkow, Matthew, Bryan T Kelly, and Matt Taddy. 2017. “Text as Data.” National Bureau of Economic Research.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;ref-Hansen2016&#34;&gt;
&lt;p&gt;Hansen, Stephen, and Michael McMahon. 2016. “Shocking Language: Understanding the Macroeconomic Effects of Central Bank Communication.” &lt;em&gt;Journal of International Economics&lt;/em&gt; 99. Elsevier: S114–S133.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;ref-Tabellini2010&#34;&gt;
&lt;p&gt;Tabellini, Guido. 2010. “Culture and Institutions: Economic Development in the Regions of Europe.” &lt;em&gt;Journal of the European Economic Association&lt;/em&gt; 8 (4). Oxford University Press: 677–716.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>Introduction to LDA</title>
      <link>/post/introduction-to-lda/</link>
      <pubDate>Wed, 01 Nov 2017 00:00:00 +0000</pubDate>
      
      <guid>/post/introduction-to-lda/</guid>
      <description>&lt;div id=&#34;introduction&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;An important development of text analytics is the invention of the Latent Dirichlet Allocation (LDA) algorithm (also called topic modeling) in 2003. LDA is non negative matrix factorization algorithm. A matrix factorization consists of decomposing a matrix into a product of two or more matrices. It turned out that these linear algebra techniques have applications for data analysis. These applications are generaly referred as data dimension reductions methods. Examples of matrix factorization methods in statistics include Factor Analysis, Principal Component Analysis, and Latent Dirichlet Allocation. They are all latent variables models, which consist of using observed variables to infer the values for unobserved (or hidden) variables. The basic idea of these methods is to find &lt;span class=&#34;math inline&#34;&gt;\(\theta_{D,K}\)&lt;/span&gt; and &lt;span class=&#34;math inline&#34;&gt;\(\phi_{K,V}\)&lt;/span&gt; (two sets of hidden variables) from &lt;span class=&#34;math inline&#34;&gt;\(W_{D,V}\)&lt;/span&gt;, the set of observed variables such that: &lt;span class=&#34;math display&#34;&gt;\[W_{D,V} \simeq \theta_{D,K}*\phi_{K,V}\]&lt;/span&gt; Where &lt;span class=&#34;math inline&#34;&gt;\(D\)&lt;/span&gt; is the number of observations, &lt;span class=&#34;math inline&#34;&gt;\(V\)&lt;/span&gt; is the number of variables; and &lt;span class=&#34;math inline&#34;&gt;\(K\)&lt;/span&gt; is the number of latent variables. We want &lt;span class=&#34;math inline&#34;&gt;\(K&amp;lt;&amp;lt;V\)&lt;/span&gt;, and “hopefully” we can infer a meaning for each of the &lt;span class=&#34;math inline&#34;&gt;\(K\)&lt;/span&gt; columns of &lt;span class=&#34;math inline&#34;&gt;\(\theta_{D,K}\)&lt;/span&gt; from each of the &lt;span class=&#34;math inline&#34;&gt;\(K\)&lt;/span&gt; rows of &lt;span class=&#34;math inline&#34;&gt;\(\phi_{K,V}\)&lt;/span&gt;. Also, it turned out that most information about the observations (rows of &lt;span class=&#34;math inline&#34;&gt;\(W\)&lt;/span&gt;) contained in &lt;span class=&#34;math inline&#34;&gt;\(W_{D,V}\)&lt;/span&gt; is captured in the reduced matrix &lt;span class=&#34;math inline&#34;&gt;\(\theta_{D,K}\)&lt;/span&gt;, hence the idea of data dimension reduction. A major challenge in data dimension reduction is deciding on the appropriate value for &lt;span class=&#34;math inline&#34;&gt;\(K\)&lt;/span&gt;.&lt;/p&gt;
&lt;p&gt;To help fix ideas, let’s assume we have exams scores of 100 students on the following subjects: Gaelic, English, History, Arithmetic, Algebra, Geometry (this example is not a text data example, but it is a good one to illustrate the idea of latent variable models). The dataset is &lt;span class=&#34;math inline&#34;&gt;\(W_{D,V} = W_{100,6}\)&lt;/span&gt;; that is, 100 observations and 6 variables. Let’s assume we want to collapse the &lt;span class=&#34;math inline&#34;&gt;\(V = 6\)&lt;/span&gt; variables into &lt;span class=&#34;math inline&#34;&gt;\(K=2\)&lt;/span&gt; variables. Let’s further assume that the first variable may be termed “Humanities”, and the second variable may be termed “Math” (this is a sensible assumption!). Thus, we want to create a &lt;span class=&#34;math inline&#34;&gt;\(\theta_{100,2}\)&lt;/span&gt; matrix that captures most of the informations about the students grades on 6 subjects. With the two variables, humanities and math, we can quickly learn about the students with the help of, for example, a simple scatterplot. The &lt;span class=&#34;math inline&#34;&gt;\(\phi\)&lt;/span&gt; matrix helps us infer the meanings of the columns of &lt;span class=&#34;math inline&#34;&gt;\(\theta\)&lt;/span&gt; as humanities and math because (hopefully) one row has big coefficients for Gaelic, English, History, and small coefficients for Arithmetic, Algebra, Geometry; and the second row has big coefficients for Arithmetic, Algebra, Geometry, and small coefficients for Gaelic, English, History. I hope this example provides an intuition of what matrix factorization wants to achieve when used for data analysis. The goal is to reduce the dimension of the data, i.e. reduce the number of variables. The meaning of each of the new variables is inferred by guessing a name associated with the original variables with highest coefficients for a given new variable. In the future, I will provide a numerical example within the context of Factor Analysis. Factor analysis is a building block for understanding latent variables models.&lt;/p&gt;
&lt;p&gt;In LDA, the &lt;span class=&#34;math inline&#34;&gt;\(W\)&lt;/span&gt; matrix is a matrix of words counts, the &lt;span class=&#34;math inline&#34;&gt;\(\theta\)&lt;/span&gt; matrix is a matrix of topic proporions within each document, and the &lt;span class=&#34;math inline&#34;&gt;\(\phi\)&lt;/span&gt; matrix is a matrix of each word’s relative importance for each topic.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;lda-the-model&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;LDA: the model&lt;/h1&gt;
&lt;p&gt;This section provides a mathematical exposition of topic modeling and presents the data generative process used to estimate the &lt;span class=&#34;math inline&#34;&gt;\(\theta\)&lt;/span&gt; and &lt;span class=&#34;math inline&#34;&gt;\(\phi\)&lt;/span&gt; matrices. LDA is a generative model that represents documents as being generated by a random mixture over latent variables called topics &lt;span class=&#34;citation&#34;&gt;(David M. Blei, Ng, and Jordan 2003)&lt;/span&gt;. A topic is defined as a distribution over words. For a given corpus (a collection of documents) of D documents each of length &lt;span class=&#34;math inline&#34;&gt;\(N_{d}\)&lt;/span&gt; , the generative process for LDA is defined as follows:&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;&lt;p&gt;For each topic &lt;span class=&#34;math inline&#34;&gt;\(k\)&lt;/span&gt;, draw a distribution over words &lt;span class=&#34;math inline&#34;&gt;\(\phi_k \sim Dirichlet(\beta)\)&lt;/span&gt; with &lt;span class=&#34;math inline&#34;&gt;\(k = \{1, 2, ...K\}\)&lt;/span&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;For each document &lt;span class=&#34;math inline&#34;&gt;\(d\)&lt;/span&gt;:&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;ol style=&#34;list-style-type: lower-alpha&#34;&gt;
&lt;li&gt;&lt;p&gt;Draw a vector of topic proportions &lt;span class=&#34;math inline&#34;&gt;\(\theta_d \sim Dirichlet(\alpha)\)&lt;/span&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;For each word &lt;span class=&#34;math inline&#34;&gt;\(i\)&lt;/span&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;ol style=&#34;list-style-type: lower-roman&#34;&gt;
&lt;li&gt;&lt;p&gt;Draw a topic assignment &lt;span class=&#34;math inline&#34;&gt;\(z_{d,n} \sim multinomial(\theta_d)\)&lt;/span&gt; with &lt;span class=&#34;math inline&#34;&gt;\(z_{d,n} \in \{1, 2, ..., K\}\)&lt;/span&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Draw a word &lt;span class=&#34;math inline&#34;&gt;\(w_{d,v} \sim multinomial(\phi_{k = z_{d,n}})\)&lt;/span&gt; with &lt;span class=&#34;math inline&#34;&gt;\(w_{d,v} \in \{1, 2, ..., V\}\)&lt;/span&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Note: Only the words &lt;span class=&#34;math inline&#34;&gt;\(w\)&lt;/span&gt; are observed.&lt;/p&gt;
&lt;p&gt;The above generative process allows us to construct an explicit closed form expression for the joint likelihood of the observed and hidden variables. Markov Chain Monte Carlo (MCMC), and Variational Bayes methods can then be used to estimate the parameters &lt;span class=&#34;math inline&#34;&gt;\(\theta\)&lt;/span&gt; and &lt;span class=&#34;math inline&#34;&gt;\(\phi\)&lt;/span&gt; (See &lt;span class=&#34;citation&#34;&gt;David M. Blei, Ng, and Jordan (2003)&lt;/span&gt;; &lt;span class=&#34;citation&#34;&gt;David M. Blei (2012)&lt;/span&gt; for further exposition of the method). We derive the posterior distribution of the &lt;span class=&#34;math inline&#34;&gt;\(\theta\)&lt;/span&gt;s and &lt;span class=&#34;math inline&#34;&gt;\(\phi\)&lt;/span&gt;s in the next section.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;deriving-the-theta-and-phi-values&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Deriving the &lt;span class=&#34;math inline&#34;&gt;\(\theta\)&lt;/span&gt; and &lt;span class=&#34;math inline&#34;&gt;\(\phi\)&lt;/span&gt; values&lt;/h1&gt;
&lt;p&gt;A topic &lt;span class=&#34;math inline&#34;&gt;\(\phi_{k}\)&lt;/span&gt; is a distribution over V unique words, each having a proportion &lt;span class=&#34;math inline&#34;&gt;\(\phi_{k,v}\)&lt;/span&gt;; i.e &lt;span class=&#34;math inline&#34;&gt;\(\phi_{k,v}\)&lt;/span&gt; is the relative importance of the word v for the definition (or interpretation) of the topic k. It is assumed that:&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;math display&#34;&gt;\[\phi_{k}\sim Dirichlet_{V}(\beta)\]&lt;/span&gt; That is: &lt;span class=&#34;math display&#34;&gt;\[p(\phi_{k}|\beta)=\frac{1}{B(\beta)}\prod_{v=1}^{V}\phi_{k,v}^{\beta_{v}-1}\]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;Where &lt;span class=&#34;math inline&#34;&gt;\(B(\beta)=\frac{\prod_{v=1}^{V}\Gamma(\beta_{v})}{\Gamma(\sum_{v=1}^{V}\beta_{v})}\)&lt;/span&gt;, and &lt;span class=&#34;math inline&#34;&gt;\(\beta=(\beta_{1},...,\beta_{V})\)&lt;/span&gt;. Since we have K independent topics (by assumption), &lt;span class=&#34;math display&#34;&gt;\[p(\phi|\beta)=\prod_{k=1}^{K}\frac{1}{B(\beta)}\prod_{v=1}^{V}\phi_{k,v}^{\beta_{v}-1}\]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;A document d is a distribution over K topics, each having a proportion &lt;span class=&#34;math inline&#34;&gt;\(\theta_{d,k}\)&lt;/span&gt;, i.e. &lt;span class=&#34;math inline&#34;&gt;\(\theta_{d,k}\)&lt;/span&gt; is the relative importance of the topic k, in the document d. We assume:&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;math display&#34;&gt;\[\theta_{d}\sim Dirichlet_{K}(\alpha)\]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;That is:&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;math display&#34;&gt;\[p(\theta_{d}|\alpha)=\frac{1}{B(\alpha)}\prod_{k=1}^{K}\theta_{d,k}^{\alpha_{k}-1}\]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;And since we have D independent documents (by assumption),&lt;span class=&#34;math display&#34;&gt;\[p(\theta|\alpha)=\prod_{d=1}^{D}\frac{1}{B(\alpha)}\prod_{k=1}^{K}\theta_{d,k}^{\alpha_{k}-1}\]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;It is further assumed that &lt;span class=&#34;math inline&#34;&gt;\(\beta_{v}=\beta\)&lt;/span&gt;, and &lt;span class=&#34;math inline&#34;&gt;\(\alpha_{k}=\alpha\)&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;Let &lt;span class=&#34;math inline&#34;&gt;\(z\)&lt;/span&gt; be the latent topic assignment variable, i.e. the random variable &lt;span class=&#34;math inline&#34;&gt;\(z_{d,n}\)&lt;/span&gt; assigns the word &lt;span class=&#34;math inline&#34;&gt;\(w_{d,n}\)&lt;/span&gt; to the topic k in document &lt;span class=&#34;math inline&#34;&gt;\(d\)&lt;/span&gt;. &lt;span class=&#34;math inline&#34;&gt;\(z_{d,n}\)&lt;/span&gt; is a vector of zeros and 1 at the &lt;span class=&#34;math inline&#34;&gt;\(k^{th}\)&lt;/span&gt; position &lt;span class=&#34;math inline&#34;&gt;\((z_{d,n}=[0,0,...1,0,..])\)&lt;/span&gt;. Define &lt;span class=&#34;math inline&#34;&gt;\(z_{d,n,k}=I(z_{d,n}=k)\)&lt;/span&gt; where &lt;span class=&#34;math inline&#34;&gt;\(I\)&lt;/span&gt; is an indicator function that assigns 1 to the random variable &lt;span class=&#34;math inline&#34;&gt;\(z_{d,n}\)&lt;/span&gt; when &lt;span class=&#34;math inline&#34;&gt;\(z_{d,n}\)&lt;/span&gt; is the topic &lt;span class=&#34;math inline&#34;&gt;\(k\)&lt;/span&gt;, and &lt;span class=&#34;math inline&#34;&gt;\(0\)&lt;/span&gt; otherwise.We assume:&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;math display&#34;&gt;\[z_{d,n}\sim Multinomial(\theta_{d})\]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;That is: &lt;span class=&#34;math display&#34;&gt;\[p(z_{d,n,k}|\theta_{d})  =\theta_{d,k}
=   \prod_{k=1}^{K}\theta_{d,k}^{z_{d,n,k}}\]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;A document is assumed to have &lt;span class=&#34;math inline&#34;&gt;\(N_{d}\)&lt;/span&gt; independent words, and since we assume D independent documents, we have:&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;math display&#34;&gt;\[p(z|\theta)   =\prod_{d=1}^{D}\prod_{n=1}^{N_{d}}\prod_{k=1}^{K}\theta_{d,k}^{z_{d,n,k}}\]&lt;/span&gt; &lt;span class=&#34;math display&#34;&gt;\[= \prod_{d=1}^{D}\prod_{k=1}^{K}\prod_{n=1}^{N_{d}}\theta_{d,k}^{z_{d,n,k}}\]&lt;/span&gt; &lt;span class=&#34;math display&#34;&gt;\[= \prod_{d=1}^{D}\prod_{k=1}^{K}\prod_{v=1}^{V}\theta_{d,k}^{n_{d,v}*z_{d,v,k}}\]&lt;/span&gt; &lt;span class=&#34;math display&#34;&gt;\[= \prod_{d=1}^{D}\prod_{v=1}^{V}\prod_{k=1}^{K}\theta_{d,k}^{n_{d,v}*z_{d,v,k}}\]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;math inline&#34;&gt;\(n_{d,v}\)&lt;/span&gt; is the count of the word v in document d.&lt;/p&gt;
&lt;p&gt;The word &lt;span class=&#34;math inline&#34;&gt;\(w_{d,n}\)&lt;/span&gt; is drawn from the topic’s words distribution &lt;span class=&#34;math inline&#34;&gt;\(\phi_{k}\)&lt;/span&gt;:&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;math display&#34;&gt;\[w_{d,n}|\phi_{k=z_{d,n,k}}\sim Multinomial(\phi_{k=z_{d,n}})\]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;math display&#34;&gt;\[p(w_{d,n} =v|\phi_{k=z_{d,n}})=\phi_{k,v}\]&lt;/span&gt; &lt;span class=&#34;math display&#34;&gt;\[= \prod_{v=1}^{V}\prod_{k=1}^{K}\phi_{k,v}^{w_{d,n,v}*z_{d,n,k}}\]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;math inline&#34;&gt;\(w_{d,n}\)&lt;/span&gt; is a vector of zeros and 1 at the &lt;span class=&#34;math inline&#34;&gt;\(v^{th}\)&lt;/span&gt; position. Define &lt;span class=&#34;math inline&#34;&gt;\(w_{d,n,v}=I(w_{d,n}=v)\)&lt;/span&gt; where &lt;span class=&#34;math inline&#34;&gt;\(I\)&lt;/span&gt; is an indicator function that assigns &lt;span class=&#34;math inline&#34;&gt;\(1\)&lt;/span&gt; to the random variable &lt;span class=&#34;math inline&#34;&gt;\(w_{d,n}\)&lt;/span&gt; when &lt;span class=&#34;math inline&#34;&gt;\(w_{d,n}\)&lt;/span&gt; is the word &lt;span class=&#34;math inline&#34;&gt;\(v\)&lt;/span&gt;, and &lt;span class=&#34;math inline&#34;&gt;\(0\)&lt;/span&gt; otherwise.&lt;/p&gt;
&lt;p&gt;There are D independent documents, each having &lt;span class=&#34;math inline&#34;&gt;\(N_{d}\)&lt;/span&gt; independent words, so: &lt;span class=&#34;math display&#34;&gt;\[p(w|\phi)=\prod_{d=1}^{D}\prod_{n=1}^{N_{d}}\prod_{v=1}^{V}\prod_{k=1}^{K}\phi_{k,v}^{w_{d,n,v}*z_{d,n,k}}\]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;math display&#34;&gt;\[p(w|\phi)=\prod_{d=1}^{D}\prod_{v=1}^{V}\prod_{k=1}^{K}\phi_{k,v}^{n_{d,v}*z_{d,v,k}}\]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;The joint distribution of the observed words w and unobserved (or hidden variables) &lt;span class=&#34;math inline&#34;&gt;\(\theta\)&lt;/span&gt;, &lt;span class=&#34;math inline&#34;&gt;\(z\)&lt;/span&gt;, and &lt;span class=&#34;math inline&#34;&gt;\(\phi\)&lt;/span&gt; is given by:&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;math display&#34;&gt;\[P(w,z,\theta,\phi|\alpha,\beta)=p(\theta|\alpha)p(z|\theta)p(w|\phi,z)p(\phi|\beta)\]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;The goal is to get the posterior distribution of the unobserved variables: &lt;span class=&#34;math display&#34;&gt;\[p(z,\theta,\phi|w,\alpha,\beta)=\frac{P(w,z,\theta,\phi|\alpha,\beta)}{\int\int\sum_{z}P(w,z,\theta,\phi|\alpha,\beta)d\theta d\phi}\]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;math inline&#34;&gt;\(\int\int\sum_{z}P(w,z,\theta,\phi|\alpha,\beta)d\theta d\phi\)&lt;/span&gt; is intractable, so approximation methods are used to approximate the posterior distribution. The seminal paper of LDA &lt;span class=&#34;citation&#34;&gt;(David M. Blei, Ng, and Jordan 2003)&lt;/span&gt; uses the Mean Field Variational Bayes (an optimization method) to approximate the posteriors distribution (See &lt;span class=&#34;citation&#34;&gt;Bishop (2006)&lt;/span&gt;, pp. 462 or &lt;span class=&#34;citation&#34;&gt;David M Blei, Kucukelbir, and McAuliffe (2017)&lt;/span&gt; for an exposition of the theory of the variational method). The mean field variational inference uses the following approximation: &lt;span class=&#34;math display&#34;&gt;\[p(z,\theta,\phi|w,\alpha,\beta)\simeq q(z,\theta,\phi)=q(z)q(\theta)q(\phi)\]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;From &lt;span class=&#34;citation&#34;&gt;Bishop (2006)&lt;/span&gt;, [p. 466], we have: &lt;span class=&#34;math display&#34;&gt;\[q^{*}(z)\propto exp\left\{ E_{\theta,\phi}\left[log(p(z|\theta))+log(p(w|\phi,z))\right]\right\}\]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;math display&#34;&gt;\[q^{*}(\theta)\propto exp\left\{ E_{z,\phi}\left[log(p(\theta|\alpha))+log(p(z|\theta))\right]\right\}\]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;math display&#34;&gt;\[q^{*}(\phi)\propto exp\left\{ E_{\theta,z}\left[log(p(\phi|\beta))+log(p(w|\phi,z))\right]\right\}\]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;Using the expressions above, we have:&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;math display&#34;&gt;\[log(q^{*}(z)) \propto E_{\theta,\phi}\left[\sum_{d=1}^{D}\sum_{v=1}^{V}\sum_{k=1}^{K}n_{d,v}*z_{d,v,k}\left(log(\theta_{d,k})+log(\phi_{k,v})\right)\right]\]&lt;/span&gt; &lt;span class=&#34;math display&#34;&gt;\[\propto   \sum_{d=1}^{D}\sum_{v=1}^{V}\sum_{k=1}^{K}n_{d,v}*z_{d,v,k}\left(E(log(\theta_{d,k}))+E(log(\phi_{k,v}))\right)\]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;Note that &lt;span class=&#34;math display&#34;&gt;\[x|p\sim Multinomial_{K}(p)\iff log\left(p(x|p)\right)=\sum_{k=1}^{K}x_{k}log(p_{k})\]&lt;/span&gt;, and let’s define &lt;span class=&#34;math inline&#34;&gt;\(log(p_{k})=E(log(\theta_{d,k})+E(log(\phi_{k,v}))\)&lt;/span&gt;, so &lt;span class=&#34;math inline&#34;&gt;\(p_{k}=exp(E(log(\theta_{d,k}))+E(log(\phi_{k,v})))\)&lt;/span&gt;. Thus,&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;math display&#34;&gt;\[q^{*}(z)\propto\prod_{d=1}^{D}\prod_{v=1}^{V}\prod_{k=1}^{K}\left[exp(E(log(\theta_{d,k}))+E(log(\phi_{k,v})))\right]^{n_{d,v}*z_{d,v,k}}\]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;That is, &lt;span class=&#34;math display&#34;&gt;\[z_{d,v}|w_{d},\theta_{d},\phi_{k}\sim Multinomial_{K}(p_{k})\]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;and by the multinomial properties,&lt;span class=&#34;math inline&#34;&gt;\(E(z_{d,v,k})=p_{k}=exp(E(log(\theta_{d,k}))+E(log(\phi_{k,v})))\)&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;math display&#34;&gt;\[q^{*}(\theta) \propto exp\left\{ E_{z}\left[\sum_{d}\sum_{k}(\alpha-1)log(\theta_{d,k})+\sum_{d}\sum_{k}\sum_{v}n_{d,v}*z_{d,v,k}log(\theta_{d,k})\right]\right\}\]&lt;/span&gt; &lt;span class=&#34;math display&#34;&gt;\[= \prod_{d}^{D}\prod_{k=1}^{K}exp\left\{ (\alpha+\sum_{v=1}^{V}n_{d,v}E(z_{d,v,k})-1)log(\theta_{d,k})\right\}\]&lt;/span&gt; &lt;span class=&#34;math display&#34;&gt;\[= \prod_{d=1}^{D}\prod_{k=1}^{K}\theta_{d,k}^{\alpha+\sum_{v=1}^{V}n_{d,v}E(z_{d,v,k})-1}\]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;Thus, the approximate posterior distribution of the topics distribution in a document d is:&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;math display&#34;&gt;\[\theta_{d}|w_{d},\alpha=Dirichlet_{K}(\tilde{\alpha}_{d})\]&lt;/span&gt; where &lt;span class=&#34;math inline&#34;&gt;\(\tilde{\alpha}_{d}=\alpha+\sum_{v=1}^{V}n_{d,v}E(z_{d,v,.})\)&lt;/span&gt;. Note that &lt;span class=&#34;math inline&#34;&gt;\(\tilde{\alpha}_{d}\)&lt;/span&gt; is a vector of K dimension.&lt;/p&gt;
&lt;p&gt;By the properties of the Dirichlet distribution, the expected value of &lt;span class=&#34;math inline&#34;&gt;\(\theta_{d}|\tilde{\alpha}_{d}\)&lt;/span&gt; is given by:&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;math display&#34;&gt;\[E(\theta_{d}|\tilde{\alpha_{d}})=\frac{\alpha+\sum_{v=1}^{V}n_{d,v}E(z_{d,v,.})}{\sum_{k=1}^{K}[\alpha+\sum_{v=1}^{V}E(z_{d,v,k})]}\]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;The numerical estimation of &lt;span class=&#34;math inline&#34;&gt;\(E(\theta_{d}|\tilde{\alpha}_{d})\)&lt;/span&gt; gives the estimates of the topics proportions within each document &lt;span class=&#34;math inline&#34;&gt;\(d\)&lt;/span&gt;, &lt;span class=&#34;math inline&#34;&gt;\((\hat\theta_{d})\)&lt;/span&gt;. It is worth noting that &lt;span class=&#34;math inline&#34;&gt;\(E(z_{d,v,k})\)&lt;/span&gt; can be interpreted as the responsibility that topic &lt;span class=&#34;math inline&#34;&gt;\(k\)&lt;/span&gt; takes for explaining the observation of the word v in document d. Ignoring for a moment the denominator of equation above, &lt;span class=&#34;math inline&#34;&gt;\(E(\theta_{d,k}|\tilde{\alpha}_{d,k})\)&lt;/span&gt; is similar to a regression equation where &lt;span class=&#34;math inline&#34;&gt;\(n_{d,v}\)&lt;/span&gt; are the observed counts of words in document &lt;span class=&#34;math inline&#34;&gt;\(d\)&lt;/span&gt;, and &lt;span class=&#34;math inline&#34;&gt;\(E(z_{d,v,k})\)&lt;/span&gt; are the parameter estimates (or weight) of the words. That illustrates that the importance of a topic in a document is due to the high presence of words &lt;span class=&#34;math inline&#34;&gt;\((n_{d,v})\)&lt;/span&gt; referring to that topic, and the weight of these words &lt;span class=&#34;math inline&#34;&gt;\((E(z_{d,v,k}))\)&lt;/span&gt;.&lt;/p&gt;
&lt;p&gt;Similarly,&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;math display&#34;&gt;\[q^{*}(\phi)   \propto exp\left\{ E_{z}\left[\sum_{k=1}^{K}\sum_{v=1}^{V}(\beta-1)log(\phi_{k,v})+\sum_{d=1}^{D}\sum_{k=1}^{K}\sum_{v=1}^{V}n_{d,v}*z_{d,v,k}log(\phi_{k,v})\right]\right\}\]&lt;/span&gt; &lt;span class=&#34;math display&#34;&gt;\[= \prod_{k=1}^{K}\prod_{v=1}^{V}exp\left\{ (\beta+\sum_{d=1}^{D}n_{d,v}*E(z_{d,v,k})-1)log(\phi_{k,v})\right\}\]&lt;/span&gt; &lt;span class=&#34;math display&#34;&gt;\[= \prod_{k=1}^{K}\prod_{v=1}^{V}\phi_{k,v}^{\beta+\sum_{d=1}^{D}n_{d,v}*E(z_{d,v,k})}\]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;Thus, the approximate posterior distribution of the words distribution in a topic &lt;span class=&#34;math inline&#34;&gt;\(\hat\phi_{k}\)&lt;/span&gt; is:&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;math inline&#34;&gt;\(\phi_{k}|w,\beta\sim Dirichlet_{V}(\tilde{\beta_{k}})\)&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;where &lt;span class=&#34;math inline&#34;&gt;\(\tilde{\beta_{k}}=\beta+\sum_{d=1}^{D}n_{d,v}*E(z_{d,.,k})\)&lt;/span&gt;. Note that &lt;span class=&#34;math inline&#34;&gt;\(\tilde{\beta}_{k}\)&lt;/span&gt; is a vector of V dimension.&lt;/p&gt;
&lt;p&gt;And the expected value of &lt;span class=&#34;math inline&#34;&gt;\(\phi_{k}|\tilde{\beta}_{k}\)&lt;/span&gt; is given by:&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;math display&#34;&gt;\[  
E(\phi_{k}|\tilde{\beta_{k}})=\frac{\beta+\sum_{d=1}^{D}n_{d,v}*E(z_{d,.,k})}{\sum_{v=1}^{V}(\beta+\sum_{d=1}^{D}n_{d,v}*E(z_{d,v,k}))} 
\]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;The numerical estimation of &lt;span class=&#34;math inline&#34;&gt;\(E(\phi_{k}|\tilde{\beta}_{k})\)&lt;/span&gt; gives the estimates of the words relative importance for each topic &lt;span class=&#34;math inline&#34;&gt;\(k\)&lt;/span&gt;, &lt;span class=&#34;math inline&#34;&gt;\((\phi_{k})\)&lt;/span&gt;. Ignoring the denominator in the equation above, &lt;span class=&#34;math inline&#34;&gt;\(E(\phi_{k,v}|\tilde{\beta_{k,v}})\)&lt;/span&gt; is the weighted sum of the the frequencies of the word &lt;span class=&#34;math inline&#34;&gt;\(v\)&lt;/span&gt; in each of the documents &lt;span class=&#34;math inline&#34;&gt;\((n_{d,v})\)&lt;/span&gt;, the weights being the responsibility topic &lt;span class=&#34;math inline&#34;&gt;\(k\)&lt;/span&gt; takes for explaining the observation of the word &lt;span class=&#34;math inline&#34;&gt;\(v\)&lt;/span&gt; in document &lt;span class=&#34;math inline&#34;&gt;\(d\)&lt;/span&gt; &lt;span class=&#34;math inline&#34;&gt;\((E(z_{d,v,k}))\)&lt;/span&gt;.&lt;/p&gt;
&lt;p&gt;Here, we have derived the posteriors expected values of the &lt;span class=&#34;math inline&#34;&gt;\(\theta\)&lt;/span&gt;s and &lt;span class=&#34;math inline&#34;&gt;\(\phi\)&lt;/span&gt;s using the words counts &lt;span class=&#34;math inline&#34;&gt;\(n_{d,v}\)&lt;/span&gt;, which is slightly different from &lt;span class=&#34;citation&#34;&gt;David M. Blei, Ng, and Jordan (2003)&lt;/span&gt;. Posterior formulae similar to the current derived solution can be found in &lt;span class=&#34;citation&#34;&gt;Murphy (2012)&lt;/span&gt;, p. 962.&lt;/p&gt;
&lt;p&gt;In sum, the rows of &lt;span class=&#34;math inline&#34;&gt;\(\phi_{K,V}=\left[E(\phi_{k}|\tilde{\beta}_{k})\right]_{K,V}\)&lt;/span&gt; are useful for interpreting (or identifying) the themes, which relative importance in each document are represented by the columns of &lt;span class=&#34;math inline&#34;&gt;\(\theta_{D,K}=\left[E(\theta_{d}|\tilde{\alpha}_{d})\right]_{D,K}\)&lt;/span&gt;.&lt;/p&gt;
&lt;p&gt;Practical tools for estimating the topics distributions of a corpus exist (see &lt;span class=&#34;citation&#34;&gt;Grun and Hornik (2011)&lt;/span&gt;; &lt;span class=&#34;citation&#34;&gt;Silge and Robinson (2017 Chap. 6)&lt;/span&gt;).&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;references&#34; class=&#34;section level1 unnumbered&#34;&gt;
&lt;h1&gt;References&lt;/h1&gt;
&lt;div id=&#34;refs&#34; class=&#34;references&#34;&gt;
&lt;div id=&#34;ref-Bishop2006&#34;&gt;
&lt;p&gt;Bishop, Christopher M. 2006. &lt;em&gt;Pattern Recognition and Machine Learning&lt;/em&gt;. springer.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;ref-Blei2017&#34;&gt;
&lt;p&gt;Blei, David M, Alp Kucukelbir, and Jon D McAuliffe. 2017. “Variational Inference: A Review for Statisticians.” &lt;em&gt;Journal of the American Statistical Association&lt;/em&gt;, no. just-accepted. Taylor &amp;amp; Francis.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;ref-Blei2012&#34;&gt;
&lt;p&gt;Blei, David M. 2012. “Probabilistic Topic Models.” &lt;em&gt;Commun. ACM&lt;/em&gt; 55 (4). New York, NY, USA: ACM: 77–84. doi:&lt;a href=&#34;https://doi.org/10.1145/2133806.2133826&#34;&gt;10.1145/2133806.2133826&lt;/a&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;ref-Blei2003&#34;&gt;
&lt;p&gt;Blei, David M., Andrew Y. Ng, and Michael I. Jordan. 2003. “Latent Dirichlet Allocation.” &lt;em&gt;J. Mach. Learn. Res.&lt;/em&gt; 3 (March). JMLR.org: 993–1022. &lt;a href=&#34;http://dl.acm.org/citation.cfm?id=944919.944937&#34; class=&#34;uri&#34;&gt;http://dl.acm.org/citation.cfm?id=944919.944937&lt;/a&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;ref-Grun2011&#34;&gt;
&lt;p&gt;Grun, Bettina, and Kurt Hornik. 2011. “Topicmodels: An R Package for Fitting Topic Models.” &lt;em&gt;Journal of Statistical Software, Articles&lt;/em&gt; 40 (13): 1–30. doi:&lt;a href=&#34;https://doi.org/10.18637/jss.v040.i13&#34;&gt;10.18637/jss.v040.i13&lt;/a&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;ref-Murphy2012&#34;&gt;
&lt;p&gt;Murphy, Kevin P. 2012. &lt;em&gt;Machine Learning: A Probabilistic Perspective&lt;/em&gt;. MIT press.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;ref-Silge2017&#34;&gt;
&lt;p&gt;Silge, J., and D. Robinson. 2017. &lt;em&gt;Text Mining with R: A Tidy Approach&lt;/em&gt;. O’Reilly Media, Incorporated. &lt;a href=&#34;https://books.google.com/books?id=7bQzMQAACAAJ&#34; class=&#34;uri&#34;&gt;https://books.google.com/books?id=7bQzMQAACAAJ&lt;/a&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
</description>
    </item>
    
  </channel>
</rss>
