<?xml version="1.0" encoding="utf-8" standalone="yes" ?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>Matrix Factorization on Salfo Bikienga</title>
    <link>/tags/matrix-factorization/</link>
    <description>Recent content in Matrix Factorization on Salfo Bikienga</description>
    <generator>Hugo -- gohugo.io</generator>
    <language>en-us</language>
    <copyright>&amp;copy; 2017 Salfo Bikienga</copyright>
    <lastBuildDate>Fri, 17 Nov 2017 00:00:00 +0000</lastBuildDate>
    <atom:link href="/tags/matrix-factorization/" rel="self" type="application/rss+xml" />
    
    <item>
      <title>Topic modeling: The Intuition</title>
      <link>/post/topic-modeling-the-intuition/</link>
      <pubDate>Fri, 17 Nov 2017 00:00:00 +0000</pubDate>
      
      <guid>/post/topic-modeling-the-intuition/</guid>
      <description>&lt;div id=&#34;introduction&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;Whenever I give a talk on topic modeling to people not familiar with the subject, the usual question I receive is: “can you provide some intuition behind topic modeling?” Another variant of the same question is: “This is magic. How can the computer identify the topics in the documents?”. No! It is not magic. It is Math. I presented the math behind &lt;a href=&#34;http://www.salfobikienga.rbind.io/post/introduction-to-lda/&#34; target=&#34;_blank&#34;&gt;Latent Dirichlet Allocation&lt;/a&gt;, and an &lt;a href=&#34;http://www.salfobikienga.rbind.io/post/topic-modeling-an-application/&#34; target=&#34;_blank&#34;&gt;example apllication&lt;/a&gt; in previous posts. Here is my attempt at providing the intuition from the perspective of someone with basic understanding of simple linear regression, and a bit of matrix algebra.&lt;br /&gt;
Topic modeling is a form of matrix factorization. Though modern topic modeling algorithms involve complex probability theory, the basic intuition can be developed through simple matrix factorization.&lt;br /&gt;
Matrix factorization can be understood as a form of data dimension reduction method. In a world of “big data”, the usefulness of such method is immense. For instance, linear regression, the most used statistical tool in economics is only applicable when &lt;span class=&#34;math inline&#34;&gt;\(n\)&lt;/span&gt;, the number of observations is at least as big as &lt;span class=&#34;math inline&#34;&gt;\(p\)&lt;/span&gt;, the number of variables. When &lt;span class=&#34;math inline&#34;&gt;\(p\)&lt;/span&gt; is too big, we resort to some dimension reduction methods such as choosing a few variables based on theory, using &lt;a href=&#34;https://en.wikipedia.org/wiki/Stepwise_regression&#34; target=&#34;_blank&#34;&gt;stepwise&lt;/a&gt;, or &lt;a href=&#34;https://en.wikipedia.org/wiki/Lasso_(statistics)&#34; target=&#34;_blank&#34;&gt;LASSO&lt;/a&gt; regression. With matrix factorization, we do not have to select variables. We can just “redifined” the variables in a lower dimensional space, that is, convert the &lt;span class=&#34;math inline&#34;&gt;\(p\)&lt;/span&gt; dimensional data into &lt;span class=&#34;math inline&#34;&gt;\(k\)&lt;/span&gt; dimensional data, where &lt;span class=&#34;math inline&#34;&gt;\(k\)&lt;/span&gt; is significantly less than &lt;span class=&#34;math inline&#34;&gt;\(p\)&lt;/span&gt; (&lt;span class=&#34;math inline&#34;&gt;\(k&amp;lt;&amp;lt;p\)&lt;/span&gt;). The question is, how does that make sense? It is just matrix algebra, as you will see very soon.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;the-idea-of-dimension-reduction-from-matrix-factorization-perspective&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;The idea of dimension reduction from matrix factorization perspective&lt;/h1&gt;
&lt;p&gt;1- Consider measures of &lt;code&gt;length&lt;/code&gt;, &lt;code&gt;width&lt;/code&gt;, and &lt;code&gt;depth&lt;/code&gt;. These are three variables, i.e. three dimensional data. If size is enough information we care about, then &lt;code&gt;volume&lt;/code&gt;, that is, &lt;code&gt;Volume = length x width x depth&lt;/code&gt; is a good variable. Thus, we can collapse the three variables (&lt;code&gt;length&lt;/code&gt;, &lt;code&gt;width&lt;/code&gt;, &lt;code&gt;depth&lt;/code&gt;) into a single variable, &lt;code&gt;volume&lt;/code&gt; and preserve the essential information needed.&lt;/p&gt;
&lt;p&gt;2- Consider measures of &lt;code&gt;height&lt;/code&gt;, &lt;code&gt;weigth&lt;/code&gt;, and &lt;code&gt;waist&lt;/code&gt;. These are three variables, i.e. three dimensional data. If size provides enough information for what we need, then some form of linear combination of the three variables (&lt;span class=&#34;math inline&#34;&gt;\(size = b_1\times height + b_2 \times weight + b_3 \times waist\)&lt;/span&gt;) will do. Thus, we collapse the three dimensional data into a one dimensional data.&lt;/p&gt;
&lt;p&gt;3- Consider a dataset of words counts in several documents. Let’s consider the following words: &lt;code&gt;college&lt;/code&gt;, &lt;code&gt;drugs&lt;/code&gt;, &lt;code&gt;education&lt;/code&gt;, &lt;code&gt;graduation&lt;/code&gt;, &lt;code&gt;health&lt;/code&gt;, &lt;code&gt;medicaid&lt;/code&gt;. This is a six dimensional data. If what we care about are the concepts of education and health care, then some form of linear combination of these words counts will do. Our task consists of finding the appropriate weights so that a document having higher counts of education related words than other words gets a high value for the education concept, and low value for the health concept. And a document having higher counts of health related words than other words gets a high value for the health concept, and a low value for the education concept. Thus, we reduce the six dimensional data into two dimensional data, while preserving the essential information we care about.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;the-idea-of-matrix-factorization&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;The idea of matrix factorization&lt;/h1&gt;
The idea of matrix factorization stems from the fact that any matrix can be decomposed into the product of two or more matrices. Let &lt;span class=&#34;math inline&#34;&gt;\(W_{n,p}\)&lt;/span&gt; be a matrix of dataset with &lt;span class=&#34;math inline&#34;&gt;\(n\)&lt;/span&gt; rows and &lt;span class=&#34;math inline&#34;&gt;\(p\)&lt;/span&gt; columns. We can write the same matrix as the product of two matrices, such as:
&lt;span class=&#34;math display&#34;&gt;\[\begin{equation}
W_{n,p} \simeq Z_{n,k}B_{k,p}
\label{eq:fac1}
\end{equation}\]&lt;/span&gt;
&lt;p&gt;It turns out that &lt;span class=&#34;math inline&#34;&gt;\(Z\)&lt;/span&gt; preserves the essential information needed to understand variations between the &lt;span class=&#34;math inline&#34;&gt;\(n\)&lt;/span&gt; observations in &lt;span class=&#34;math inline&#34;&gt;\(W\)&lt;/span&gt;.&lt;/p&gt;
&lt;div id=&#34;illustrative-example&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Illustrative example&lt;/h2&gt;
&lt;p&gt;Let &lt;span class=&#34;math inline&#34;&gt;\(W_{n,p}\)&lt;/span&gt; be a spreadsheet of words counts in &lt;span class=&#34;math inline&#34;&gt;\(n\)&lt;/span&gt; documents. &lt;span class=&#34;math inline&#34;&gt;\(p\)&lt;/span&gt; is the number of unique words, and can be seen as variables. Here, &lt;span class=&#34;math inline&#34;&gt;\(n=6\)&lt;/span&gt;, &lt;span class=&#34;math inline&#34;&gt;\(p = 5\)&lt;/span&gt;, &lt;span class=&#34;math inline&#34;&gt;\(k = 2\)&lt;/span&gt;. Let &lt;code&gt;college&lt;/code&gt;, &lt;code&gt;education&lt;/code&gt;, &lt;code&gt;family&lt;/code&gt;, &lt;code&gt;health&lt;/code&gt;, and &lt;code&gt;medicaid&lt;/code&gt; be respectively the variables names of the &lt;span class=&#34;math inline&#34;&gt;\(W\)&lt;/span&gt; matrix.&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;math display&#34;&gt;\[ 
\underbrace{\begin{bmatrix}
4&amp;amp;6&amp;amp;0&amp;amp;2&amp;amp;2  \\ 
0&amp;amp;0&amp;amp;4&amp;amp;8&amp;amp;12  \\ 
6&amp;amp;9&amp;amp;1&amp;amp;5&amp;amp;6 \\    
2&amp;amp;3&amp;amp;3&amp;amp;7&amp;amp;10 \\
0&amp;amp;0&amp;amp;3&amp;amp;6&amp;amp;9 \\
4&amp;amp;6&amp;amp;1&amp;amp;4&amp;amp;5 \\
    \end{bmatrix}
        }_{\mathbf{W_{6,5}}}
=
\underbrace{\begin{bmatrix} 
2&amp;amp;0  \\
0&amp;amp;4  \\
3&amp;amp;1 \\
1&amp;amp;3 \\
0&amp;amp;3 \\
2&amp;amp;1 \\
    \end{bmatrix}
        }_{\mathbf{Z_{6,2}}} 
\underbrace{\begin{bmatrix} 
2&amp;amp;3&amp;amp;0&amp;amp;1&amp;amp;1 \\
0&amp;amp;0&amp;amp;1&amp;amp;2&amp;amp;3 \\             
    \end{bmatrix}
        }_{\mathbf{B_{2,5}}} 
\]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;It is easy to check that the product holds, that is, &lt;span class=&#34;math inline&#34;&gt;\(Z_{6,2}B_{2,5} = W_{6,5}\)&lt;/span&gt;. &lt;span class=&#34;math inline&#34;&gt;\(Z\)&lt;/span&gt; contains most of the information about the observations contained in &lt;span class=&#34;math inline&#34;&gt;\(W\)&lt;/span&gt;. With &lt;span class=&#34;math inline&#34;&gt;\(Z\)&lt;/span&gt;, we can explore, or study the variation between the observations, easily. &lt;span class=&#34;math inline&#34;&gt;\(Z\)&lt;/span&gt; is a two dimensional representation of &lt;span class=&#34;math inline&#34;&gt;\(W\)&lt;/span&gt;, and a simple scatterplot can be used to explore the data, as shown in the plot below.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;Z &amp;lt;- matrix(c(2, 0,
             0, 4,
             3, 1,
             1, 3,
             0, 3,
             2, 1), byrow = TRUE, nrow = 6)
Z &amp;lt;- data.frame(z1 = Z[,1], z2 = Z[, 2])
plot(x = Z$z1, y = Z$z2, cex = 3)
text(x = Z$z1, y = Z$z2, labels= 1:6, cex= 1)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;/post/2017-12-17-topic-modeling-the-intuition_files/figure-html/unnamed-chunk-1-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;From the plot, we can deduce that observations (or documents) 2, 4 and 5 are close to each other; observations 1, 6 and 3 are also close to each other. The point here is that with a reduced dimension, it is easier to draw some insight from the data. Hence, the benefit of matrix factorization for data analysis.&lt;/p&gt;
&lt;p&gt;For a &lt;strong&gt;predictive modeling&lt;/strong&gt; exercise, we replace the &lt;span class=&#34;math inline&#34;&gt;\(W\)&lt;/span&gt; matrix with the &lt;span class=&#34;math inline&#34;&gt;\(Z\)&lt;/span&gt; matrix, and the usual tools (linear regression, logistic regression, regression tree, etc.) can be used. We do not have to understand the meaning of the new variables &lt;span class=&#34;math inline&#34;&gt;\(Z\)&lt;/span&gt;. We only care about their ability to predict.&lt;br /&gt;
However, for &lt;strong&gt;exploratory&lt;/strong&gt; and &lt;strong&gt;inferential&lt;/strong&gt; data analysis, we want to understand the meaning of the new variables &lt;span class=&#34;math inline&#34;&gt;\(Z\)&lt;/span&gt;. To tell a story, we have to know the meaning of the variables. We infer the meaning of the new variables by inspecting the &lt;span class=&#34;math inline&#34;&gt;\(B\)&lt;/span&gt; matrix. I will explain why that is the case shortly. For now, note that the number of columns of &lt;span class=&#34;math inline&#34;&gt;\(B\)&lt;/span&gt; is the number of columns of the &lt;span class=&#34;math inline&#34;&gt;\(W\)&lt;/span&gt; matrix; and the number of its rows is the number of columns of the &lt;span class=&#34;math inline&#34;&gt;\(Z\)&lt;/span&gt; matrix. Each row of &lt;span class=&#34;math inline&#34;&gt;\(B\)&lt;/span&gt; is used to interpret the meaning of each column of &lt;span class=&#34;math inline&#34;&gt;\(Z\)&lt;/span&gt;. Row &lt;span class=&#34;math inline&#34;&gt;\(i\)&lt;/span&gt; of &lt;span class=&#34;math inline&#34;&gt;\(B\)&lt;/span&gt; is used to infer the meaning of column &lt;span class=&#34;math inline&#34;&gt;\(i\)&lt;/span&gt; of &lt;span class=&#34;math inline&#34;&gt;\(Z\)&lt;/span&gt;. For instance, referring to our illustrative example above, row 1 of &lt;span class=&#34;math inline&#34;&gt;\(B\)&lt;/span&gt; has its biggest values at its first and second column, that is variables 1 and 2 (&lt;code&gt;college&lt;/code&gt; and &lt;code&gt;education&lt;/code&gt;) of the &lt;span class=&#34;math inline&#34;&gt;\(W\)&lt;/span&gt; matrix are dominant in the identification of the meaning of the first &lt;span class=&#34;math inline&#34;&gt;\(Z\)&lt;/span&gt; variable. Likewise, row 2 of &lt;span class=&#34;math inline&#34;&gt;\(B\)&lt;/span&gt; has its biggest values at its two last columns; the variables 4 and 5 (&lt;code&gt;health&lt;/code&gt; and &lt;code&gt;medicaid&lt;/code&gt;) of the &lt;span class=&#34;math inline&#34;&gt;\(W\)&lt;/span&gt; matrix are dominant in the identification of the meaning of the second &lt;span class=&#34;math inline&#34;&gt;\(Z\)&lt;/span&gt; variable. Thus, the columns of &lt;span class=&#34;math inline&#34;&gt;\(Z\)&lt;/span&gt; represent, respectively, measures of education and health concepts.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;finding-z-and-b&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Finding &lt;span class=&#34;math inline&#34;&gt;\(Z\)&lt;/span&gt; and &lt;span class=&#34;math inline&#34;&gt;\(B\)&lt;/span&gt;&lt;/h2&gt;
&lt;p&gt;There are several matrix factorization algorithms. Factor Analysis (FA), Principal Component Analysis (PCA), Non Negative Matrix Factorization (NMF), Probabilistic Semantic Analysis (PLSA) and its variants, etc. Since our goal for this introduction is to present the basic idea, let’s present an algorithm that is closer to something we are all familiar with: Ordinary Least Squares (OLS).&lt;/p&gt;
&lt;div id=&#34;multivariate-ols&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Multivariate OLS&lt;/h3&gt;
&lt;p&gt;From introductory statistics, we know that for: &lt;span class=&#34;math display&#34;&gt;\[y_{n,1} = X_{n,p}\beta_{p,1} + \epsilon_{n,1}\]&lt;/span&gt; the least squares solution for &lt;span class=&#34;math inline&#34;&gt;\(\beta_{p,1}\)&lt;/span&gt; is: &lt;span class=&#34;math display&#34;&gt;\[\hat\beta_{p,1} = (X^tX)^{-1}X^ty\]&lt;/span&gt; We are assuming that &lt;span class=&#34;math inline&#34;&gt;\((X^tX)^{-1}\)&lt;/span&gt; exists. &lt;span class=&#34;math inline&#34;&gt;\(t\)&lt;/span&gt; stands for transpose, and &lt;span class=&#34;math inline&#34;&gt;\(-1\)&lt;/span&gt; stands for inverse.&lt;/p&gt;
&lt;p&gt;In case you do not remember this formula, recall that: &lt;span class=&#34;math display&#34;&gt;\[y_{n,1} = X_{n,p}\beta_{p,1} + \epsilon_{n,1}
\Leftrightarrow 
X^tY = (X^tX)\beta + X^t\epsilon\]&lt;/span&gt; Under the assumptions of no correlation between &lt;span class=&#34;math inline&#34;&gt;\(X\)&lt;/span&gt; and &lt;span class=&#34;math inline&#34;&gt;\(\epsilon\)&lt;/span&gt;, and &lt;span class=&#34;math inline&#34;&gt;\(E(\epsilon) = 0\)&lt;/span&gt;, we can set &lt;span class=&#34;math inline&#34;&gt;\(X^t\epsilon=0\)&lt;/span&gt;. So we have: &lt;span class=&#34;math display&#34;&gt;\[X^tY = (X^tX)\beta \\
\Leftrightarrow \\
(X^tX)^{-1}X^tY = (X^tX)^{-1}(X^tX)\beta \\
\Rightarrow \\
\hat\beta = (X^tX)^{-1}X^tY
\]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;For a more than a single left hand side variable &lt;span class=&#34;math inline&#34;&gt;\(y_{n,1}\)&lt;/span&gt;, the same formula applies; and we have: &lt;span class=&#34;math display&#34;&gt;\[\hat B = (X^tX)^{-1}X^tY\]&lt;/span&gt; where &lt;span class=&#34;math inline&#34;&gt;\(B\)&lt;/span&gt; is a &lt;span class=&#34;math inline&#34;&gt;\(p\times q\)&lt;/span&gt; matrix, and &lt;span class=&#34;math inline&#34;&gt;\(Y\)&lt;/span&gt; is a &lt;span class=&#34;math inline&#34;&gt;\(n \times q\)&lt;/span&gt; matrix.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;multivariate-ols-and-matrix-factorization&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Multivariate OLS and matrix factorization&lt;/h3&gt;
What does multivariate regression have to do with matrix factorization? Note that, ignoring the &lt;span class=&#34;math inline&#34;&gt;\(\epsilon\)&lt;/span&gt;, we could have written:
&lt;span class=&#34;math display&#34;&gt;\[\begin{equation}
Y_{n,q} \simeq X_{n,p}B_{p,q}
\label{eq:ols}
\end{equation}\]&lt;/span&gt;
&lt;p&gt;This equation is very similar to the equation &lt;span class=&#34;math inline&#34;&gt;\(W_{n,p} \simeq Z_{n,k}B_{k,p}\)&lt;/span&gt;, except &lt;span class=&#34;math inline&#34;&gt;\(X\)&lt;/span&gt; is observed for the case of the multivariate OLS.&lt;/p&gt;
&lt;p&gt;In multivariate OLS, we only estimate &lt;span class=&#34;math inline&#34;&gt;\(B\)&lt;/span&gt;. For matrix factorization, we estimate &lt;span class=&#34;math inline&#34;&gt;\(Z\)&lt;/span&gt; and &lt;span class=&#34;math inline&#34;&gt;\(B\)&lt;/span&gt;.&lt;br /&gt;
From &lt;span class=&#34;math inline&#34;&gt;\(W \simeq ZB\)&lt;/span&gt;, we can solve for &lt;span class=&#34;math display&#34;&gt;\[\hat B = (Z^tZ)^{-1}Z^tW\]&lt;/span&gt; or &lt;span class=&#34;math display&#34;&gt;\[\hat Z = WB^t(BB^t)^{-1}\]&lt;/span&gt; The predicted values for &lt;span class=&#34;math inline&#34;&gt;\(W\)&lt;/span&gt; is: &lt;span class=&#34;math display&#34;&gt;\[\hat W = \hat Z \hat B\]&lt;/span&gt; To estimate &lt;span class=&#34;math inline&#34;&gt;\(B\)&lt;/span&gt;, we need &lt;span class=&#34;math inline&#34;&gt;\(Z\)&lt;/span&gt;, and to estimate &lt;span class=&#34;math inline&#34;&gt;\(Z\)&lt;/span&gt; we need &lt;span class=&#34;math inline&#34;&gt;\(B\)&lt;/span&gt;. We do not have either one. The trick is to guess some initial values for &lt;span class=&#34;math inline&#34;&gt;\(Z\)&lt;/span&gt;, and use it to estimate &lt;span class=&#34;math inline&#34;&gt;\(B\)&lt;/span&gt;, then use the estimated &lt;span class=&#34;math inline&#34;&gt;\(B\)&lt;/span&gt; to estimate a new &lt;span class=&#34;math inline&#34;&gt;\(Z\)&lt;/span&gt;. Use the new &lt;span class=&#34;math inline&#34;&gt;\(Z\)&lt;/span&gt; to estimate a new &lt;span class=&#34;math inline&#34;&gt;\(B\)&lt;/span&gt;. Continue the iteration untill some stopping criterion. Thus, we estimate &lt;span class=&#34;math inline&#34;&gt;\(Z\)&lt;/span&gt; and &lt;span class=&#34;math inline&#34;&gt;\(B\)&lt;/span&gt; iteratively (This estimation method is known as Alternating Least Squares). When do we stop the iteration?&lt;/p&gt;
&lt;p&gt;Again, &lt;span class=&#34;math inline&#34;&gt;\(\hat W = \hat Z \hat B\)&lt;/span&gt; is the predicted values for &lt;span class=&#34;math inline&#34;&gt;\(W\)&lt;/span&gt;. We itterate until the distance between &lt;span class=&#34;math inline&#34;&gt;\(W\)&lt;/span&gt; and its predicted value, &lt;span class=&#34;math inline&#34;&gt;\(\hat W\)&lt;/span&gt;, is negligible. There are several distance measures, but let’s keep things simple by using the euclidean distance, or &lt;span class=&#34;math inline&#34;&gt;\(L_2\)&lt;/span&gt; norm: &lt;span class=&#34;math display&#34;&gt;\[Q(\hat Z, \hat B) = ||W-\hat W (\hat Z, \hat B)||_2 = \sqrt{\sum_{i = 1}^n \sum_{j = 1}^p (w_{i,j} - \hat w_{i,j})^2}\]&lt;/span&gt; Thus, we minimize &lt;span class=&#34;math inline&#34;&gt;\(Q\)&lt;/span&gt;, the objective function. Following is an example implementation of a simple alternating least squares algorithm.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;W &amp;lt;- matrix(c(4,    6,    0,    2,    2,
             0,    0,    4,    8,   12,
             6,    9,    1,    5,    6,
             2,    3,    3,    7,   10,
             0,    0,    3,    6,    9,
             2,    6,    1,    4,    5), byrow = TRUE, nrow = 6)

set.seed(3)
Z_init &amp;lt;- abs(round(rnorm(n = 6*2, mean = 0, sd = 2),0))
Z_init &amp;lt;- matrix(Z_init, nrow = 6)

Z &amp;lt;- Z_init
dist_ww &amp;lt;- 1e3
max_iter &amp;lt;- 1000
iter &amp;lt;- 0
while(iter &amp;lt;= max_iter &amp;amp;&amp;amp; dist_ww &amp;gt;= 1e-6) {
  iter &amp;lt;- iter + 1
  ZZ_inv &amp;lt;- solve(t(Z)%*%Z)
  B &amp;lt;- ZZ_inv%*%t(Z)%*%W
  BB_inv &amp;lt;- solve(B%*%t(B))
  Z &amp;lt;- W%*%t(B)%*%BB_inv
  W_hat &amp;lt;- Z%*%B
  dist_ww &amp;lt;- sqrt(sum(W-W_hat)^2)
}
W &amp;lt;- data.frame(W)
names(W) &amp;lt;- c(&amp;quot;college&amp;quot;, &amp;quot;education&amp;quot;, &amp;quot;family&amp;quot;, &amp;quot;health&amp;quot;, &amp;quot;medicaid&amp;quot;)
Z &amp;lt;- data.frame(round(Z, 2))
row.names(Z) &amp;lt;- paste0(&amp;quot;document.&amp;quot;, 1:6)
names(Z) &amp;lt;- c(&amp;quot;Topic.1&amp;quot;, &amp;quot;Topic.2&amp;quot;)
B &amp;lt;- data.frame(round(B, 2), row.names = c(&amp;quot;Topic.1&amp;quot;, &amp;quot;Topic.2&amp;quot;))
names(B) &amp;lt;- c(&amp;quot;college&amp;quot;, &amp;quot;education&amp;quot;, &amp;quot;family&amp;quot;, &amp;quot;health&amp;quot;, &amp;quot;medicaid&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Below is the table of the least squares estimate of&lt;span class=&#34;math inline&#34;&gt;\(B\)&lt;/span&gt;:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;B&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##         college education family health medicaid
## Topic.1    1.18      1.96  -0.02    0.6     0.58
## Topic.2    0.50      0.85   1.11    2.5     3.60&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Observe that row 1 of &lt;span class=&#34;math inline&#34;&gt;\(B\)&lt;/span&gt; has high values in columns 1 and 2 compared to columns 3, 4, and 5; and row 2 has higher values for columns 4 and 5 compared to columns 1, 2, and 3. It is reasonable to infer that row 1 (&lt;code&gt;Topic.1&lt;/code&gt;) refers to education, and row 2 (&lt;code&gt;Topic.2&lt;/code&gt;) refers to health.&lt;/p&gt;
&lt;p&gt;Below is the the table of the least squares estimate of &lt;span class=&#34;math inline&#34;&gt;\(Z\)&lt;/span&gt;:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;Z&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##            Topic.1 Topic.2
## document.1    3.13    0.05
## document.2   -1.55    3.58
## document.3    4.31    0.97
## document.4    0.41    2.71
## document.5   -1.16    2.68
## document.6    2.26    1.03&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Observe that &lt;code&gt;Topic.1&lt;/code&gt; has big values in documents 1, 4, and 6. Likewise, &lt;code&gt;Topic.2&lt;/code&gt; has big values in documents 2, 4, and 5. Hence, we can infer that documents 1, 4, and 6 are mostly about education; and documents 2, 4, and 5 are mostly about health.&lt;/p&gt;
&lt;p&gt;We can use a scatterplot to explore the original five dimensional &lt;span class=&#34;math inline&#34;&gt;\(W\)&lt;/span&gt; data in a two dimensional &lt;span class=&#34;math inline&#34;&gt;\(Z\)&lt;/span&gt; data as follow:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;plot(x = Z$Topic.1, y = Z$Topic.2, cex = 3, 
     xlab = &amp;quot;Topic.1&amp;quot;, ylab = &amp;quot;Topic.2&amp;quot;)
text(x = Z$Topic.1, y = Z$Topic.2, labels= 1:6, cex= 1)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;/post/2017-12-17-topic-modeling-the-intuition_files/figure-html/unnamed-chunk-5-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;uniqueness-of-the-solution&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Uniqueness of the solution&lt;/h3&gt;
&lt;p&gt;The solution is not unique, as you might have noticed (note the difference in Z and B from the illustrative example and the computed Z and B) eventhough &lt;span class=&#34;math inline&#34;&gt;\(W\)&lt;/span&gt; remains the same. To see why, assume &lt;span class=&#34;math inline&#34;&gt;\(T\)&lt;/span&gt; is an orthonormal matrix, that is, &lt;span class=&#34;math inline&#34;&gt;\(T\)&lt;/span&gt; is such that &lt;span class=&#34;math inline&#34;&gt;\(TT^t = I\)&lt;/span&gt;. Then, &lt;span class=&#34;math inline&#34;&gt;\(W \simeq ZB = ZTT^tB = (ZT)(T^tB) = Z^*B^*\)&lt;/span&gt; where &lt;span class=&#34;math inline&#34;&gt;\(Z^* = ZT\)&lt;/span&gt;, and &lt;span class=&#34;math inline&#34;&gt;\(B^* = T^tB\)&lt;/span&gt;. Thus, (&lt;span class=&#34;math inline&#34;&gt;\(Z\)&lt;/span&gt;, &lt;span class=&#34;math inline&#34;&gt;\(B\)&lt;/span&gt;) and (&lt;span class=&#34;math inline&#34;&gt;\(Z^*\)&lt;/span&gt;, &lt;span class=&#34;math inline&#34;&gt;\(B^*\)&lt;/span&gt;) are both equally valid solutions. Therefore, the solution is not unique. This non uniqueness of the solution poses some challenges for inferential studies based on the reduced dimension.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;interpreting-the-new-variables&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Interpreting the new variables&lt;/h3&gt;
&lt;p&gt;Again, we use the rows of the &lt;span class=&#34;math inline&#34;&gt;\(B\)&lt;/span&gt; matrix to infer the meaning of each column of &lt;span class=&#34;math inline&#34;&gt;\(Z\)&lt;/span&gt;. Why? Observed that &lt;span class=&#34;math display&#34;&gt;\[\hat B = (Z^tZ)^{-1}Z^tW\]&lt;/span&gt; Let’s define &lt;span class=&#34;math inline&#34;&gt;\(F = (Z^tZ)^{-1}Z^t\)&lt;/span&gt; with elements &lt;span class=&#34;math inline&#34;&gt;\(f_{i,j}\)&lt;/span&gt;, that is, &lt;span class=&#34;math inline&#34;&gt;\(f_{i,j}\)&lt;/span&gt; is the value in the &lt;span class=&#34;math inline&#34;&gt;\(i^{th}\)&lt;/span&gt; row, &lt;span class=&#34;math inline&#34;&gt;\(j^{th}\)&lt;/span&gt; column of the matrix &lt;span class=&#34;math inline&#34;&gt;\(F\)&lt;/span&gt;. Thus, &lt;span class=&#34;math inline&#34;&gt;\(\hat B = FW\)&lt;/span&gt; &lt;span class=&#34;math display&#34;&gt;\[
\hat B_{k,p}=\begin{bmatrix}b_{1,1} &amp;amp; b_{1,2} &amp;amp; \cdots &amp;amp; b_{1,p}\\
b_{2,1} &amp;amp; b_{2,2} &amp;amp; \cdots &amp;amp; b_{2,p}\\
\vdots &amp;amp; \vdots &amp;amp; \ddots &amp;amp; \vdots\\
b_{k,1} &amp;amp; b_{k,2} &amp;amp; \cdots &amp;amp; b_{k,p}
\end{bmatrix}
=
\begin{bmatrix}f_{1,1} &amp;amp; f_{1,2} &amp;amp; \cdots &amp;amp; f_{1,n}\\
f_{2,1} &amp;amp; f_{2,2} &amp;amp; \cdots &amp;amp; f_{2,n}\\
\vdots &amp;amp; \vdots &amp;amp; \ddots &amp;amp; \vdots\\
f_{k,1} &amp;amp; f_{k,2} &amp;amp; \cdots &amp;amp; f_{k,n}
\end{bmatrix}
\begin{bmatrix}w_{1,1} &amp;amp; w_{1,2} &amp;amp; \cdots &amp;amp; w_{1,p}\\
w_{2,1} &amp;amp; w_{2,2} &amp;amp; \cdots &amp;amp; w_{2,p}\\
\vdots &amp;amp; \vdots &amp;amp; \ddots &amp;amp; \vdots\\
w_{n,1} &amp;amp; w_{n,2} &amp;amp; \cdots &amp;amp; w_{n,p}
\end{bmatrix}
\]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;If you still remember matrix operations from high school, note that: &lt;span class=&#34;math display&#34;&gt;\[b_{1,1} = \sum_{l=1}^nf_{1,l}\times w_{l,1} \\
= f_{1,1}w_{1,1}+f_{1,2}w_{2,1}+f_{1,3}w_{3,1}+\cdots+f_{1,n}w_{n,1}\]&lt;/span&gt; &lt;span class=&#34;math display&#34;&gt;\[b_{1,2} = \sum_{l=1}^nf_{1,l}\times w_{l,2} \\
= f_{1,1}w_{1,2}+f_{1,2}w_{2,2}+f_{1,3}w_{3,2}+\cdots+f_{1,n}w_{n,2}\]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;Observe that the source of any numerical difference between &lt;span class=&#34;math inline&#34;&gt;\(b_{1,1}\)&lt;/span&gt; and &lt;span class=&#34;math inline&#34;&gt;\(b_{1,2}\)&lt;/span&gt; is the numerical difference between the first and second column of &lt;span class=&#34;math inline&#34;&gt;\(W\)&lt;/span&gt; (the &lt;span class=&#34;math inline&#34;&gt;\(f_{i,j}\)&lt;/span&gt; are exactly the same). Also, observe that, whatever &lt;span class=&#34;math inline&#34;&gt;\(F\)&lt;/span&gt; is, &lt;span class=&#34;math inline&#34;&gt;\(b_{1,1}\)&lt;/span&gt; is a total weight of the first variable &lt;span class=&#34;math inline&#34;&gt;\(W_1\)&lt;/span&gt; (say the counts of word 1 in all the documents). Likewise, &lt;span class=&#34;math inline&#34;&gt;\(b_{1,2}\)&lt;/span&gt; is a total weight of the second variable &lt;span class=&#34;math inline&#34;&gt;\(W_2\)&lt;/span&gt; (say the count of the second word in all the documents); and so on untill &lt;span class=&#34;math inline&#34;&gt;\(b_{1,p}\)&lt;/span&gt;. Put differently, &lt;span class=&#34;math inline&#34;&gt;\(b_{1,j}\)&lt;/span&gt; is a total weight of the word &lt;span class=&#34;math inline&#34;&gt;\(W_j\)&lt;/span&gt;. Thus, the coefficients &lt;span class=&#34;math inline&#34;&gt;\([b_{1,1},b_{1,2}, \cdots,b_{1,p}]\)&lt;/span&gt; are the total weight of the words &lt;span class=&#34;math inline&#34;&gt;\(W_1, W_2, \cdots, W_p\)&lt;/span&gt;, respectively. If these are words’ weights, it is natural to use the words with highest weights to name row 1 of &lt;span class=&#34;math inline&#34;&gt;\(B\)&lt;/span&gt;. We name the remaining rows of &lt;span class=&#34;math inline&#34;&gt;\(B\)&lt;/span&gt; in similar fashion.&lt;/p&gt;
&lt;p&gt;Also, observe that the elements of the first row of &lt;span class=&#34;math inline&#34;&gt;\(B\)&lt;/span&gt; are the coefficients of the first column of &lt;span class=&#34;math inline&#34;&gt;\(Z\)&lt;/span&gt;. If row 1 of &lt;span class=&#34;math inline&#34;&gt;\(B\)&lt;/span&gt; is named, say education for example, then the first column of &lt;span class=&#34;math inline&#34;&gt;\(Z\)&lt;/span&gt; is an education variable. Hence, the naming of the columns of &lt;span class=&#34;math inline&#34;&gt;\(Z\)&lt;/span&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;the-values-of-the-new-variables-z&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;The values of the new variables &lt;span class=&#34;math inline&#34;&gt;\(Z\)&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;Again, we have &lt;span class=&#34;math display&#34;&gt;\[\hat Z = WB^t(BB^t)^{-1}\]&lt;/span&gt; Let’s define &lt;span class=&#34;math display&#34;&gt;\[N = B^t(BB^t)^{-1}\]&lt;/span&gt; Then &lt;span class=&#34;math display&#34;&gt;\[\hat Z = WN\]&lt;/span&gt; That is &lt;span class=&#34;math display&#34;&gt;\[
\hat{Z} 
= 
\begin{bmatrix}
z_{1,1} &amp;amp; z_{1,2} &amp;amp; \cdots &amp;amp; z_{1,k}\\
z_{2,1} &amp;amp; z_{2,2} &amp;amp; \cdots &amp;amp; z_{2,k}\\
\vdots &amp;amp; \vdots &amp;amp; \ddots &amp;amp; \vdots\\
z_{n,1} &amp;amp; z_{n,2} &amp;amp; \cdots &amp;amp; z_{n,k}
\end{bmatrix} 
=
\begin{bmatrix}
w_{1,1} &amp;amp; w_{1,2} &amp;amp; \cdots &amp;amp; w_{1,p}\\
w_{2,1} &amp;amp; w_{2,2} &amp;amp; \cdots &amp;amp; w_{2,p}\\
\vdots &amp;amp; \vdots &amp;amp; \ddots &amp;amp; \vdots\\
w_{n,1} &amp;amp; w_{n,2} &amp;amp; \cdots &amp;amp; w_{n,p}
\end{bmatrix}
\begin{bmatrix}
n_{1,1} &amp;amp; n_{1,2} &amp;amp; \cdots &amp;amp; n_{1,k}\\
n_{2,1} &amp;amp; n_{2,2} &amp;amp; \cdots &amp;amp; n_{2,k}\\
\vdots &amp;amp; \vdots &amp;amp; \ddots &amp;amp; \vdots\\
n_{p,1} &amp;amp; n_{p,2} &amp;amp; \cdots &amp;amp; n_{p,k}
\end{bmatrix}
\]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;Observe that &lt;span class=&#34;math display&#34;&gt;\[z_{1,1} = \sum_{m = 1}^p n_{m,1}w_{1,m} \\
 = n_{1,1}w_{1,1}+ n_{2,1}w_{1,2}+n_{3,1}w_{1,3}+\cdots+n_{p,1}w_{1,p}\]&lt;/span&gt; &lt;span class=&#34;math display&#34;&gt;\[z_{1,2} = \sum_{m = 1}^p n_{m,2}w_{1,m} \\
= n_{1,2}w_{1,1}+ n_{2,2}w_{1,2}+n_{3,2}w_{1,3}+\cdots+n_{p,2}w_{1,p}\]&lt;/span&gt; &lt;span class=&#34;math display&#34;&gt;\[z_{2,1} = \sum_{m = 1}^p n_{m,1}w_{2,m} \\
= n_{1,1}w_{2,1}+ n_{2,1}w_{2,2}+n_{3,1}w_{2,3}+\cdots+n_{p,1}w_{2,p}\]&lt;/span&gt; The numerical difference between &lt;span class=&#34;math inline&#34;&gt;\(z_{1,1}\)&lt;/span&gt; and &lt;span class=&#34;math inline&#34;&gt;\(z_{1,2}\)&lt;/span&gt; stems from the numerical difference between the weights in column &lt;span class=&#34;math inline&#34;&gt;\(1\)&lt;/span&gt; and &lt;span class=&#34;math inline&#34;&gt;\(2\)&lt;/span&gt; of the weights matrix &lt;span class=&#34;math inline&#34;&gt;\(N\)&lt;/span&gt; (&lt;span class=&#34;math inline&#34;&gt;\(N\)&lt;/span&gt; can be seen as a weight matrix). The numerical difference between &lt;span class=&#34;math inline&#34;&gt;\(z_{1,1}\)&lt;/span&gt; and &lt;span class=&#34;math inline&#34;&gt;\(z_{2,1}\)&lt;/span&gt; stems from the numerical difference between the words counts in documents &lt;span class=&#34;math inline&#34;&gt;\(1\)&lt;/span&gt; and &lt;span class=&#34;math inline&#34;&gt;\(2\)&lt;/span&gt; of the words counts matrix &lt;span class=&#34;math inline&#34;&gt;\(W\)&lt;/span&gt;.&lt;/p&gt;
&lt;p&gt;Alternatively, we can think of &lt;span class=&#34;math inline&#34;&gt;\(Z\)&lt;/span&gt; as a composite index matrix. &lt;span class=&#34;math inline&#34;&gt;\(z_{i,j}\)&lt;/span&gt; is the value of the index &lt;span class=&#34;math inline&#34;&gt;\(j\)&lt;/span&gt; in document &lt;span class=&#34;math inline&#34;&gt;\(i\)&lt;/span&gt;. For example, &lt;span class=&#34;math inline&#34;&gt;\(z_{1,1}\)&lt;/span&gt; is the value of index &lt;span class=&#34;math inline&#34;&gt;\(1\)&lt;/span&gt; in document &lt;span class=&#34;math inline&#34;&gt;\(1\)&lt;/span&gt;; &lt;span class=&#34;math inline&#34;&gt;\(z_{1,2}\)&lt;/span&gt; is the value of index &lt;span class=&#34;math inline&#34;&gt;\(2\)&lt;/span&gt; in document &lt;span class=&#34;math inline&#34;&gt;\(1\)&lt;/span&gt;. Why different index values for the same document? Because each index assigns different weights to the same words. For index &lt;span class=&#34;math inline&#34;&gt;\(1\)&lt;/span&gt;, the weights are the &lt;span class=&#34;math inline&#34;&gt;\(n_{m,1}\)&lt;/span&gt; (&lt;span class=&#34;math inline&#34;&gt;\(m=\{1, 2,\cdots,p\}\)&lt;/span&gt;). For the index &lt;span class=&#34;math inline&#34;&gt;\(2\)&lt;/span&gt;, the weights are &lt;span class=&#34;math inline&#34;&gt;\(n_{m,2}\)&lt;/span&gt;. And for the &lt;span class=&#34;math inline&#34;&gt;\(k^{th}\)&lt;/span&gt; index, the weights are &lt;span class=&#34;math inline&#34;&gt;\(n_{m,k}\)&lt;/span&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;some-variants-of-the-matrix-factorization&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Some variants of the matrix factorization&lt;/h1&gt;
&lt;p&gt;1- Note that our working example data &lt;span class=&#34;math inline&#34;&gt;\(W\)&lt;/span&gt; is a count data. Naturally, we would want &lt;span class=&#34;math inline&#34;&gt;\(Z\)&lt;/span&gt; and &lt;span class=&#34;math inline&#34;&gt;\(B\)&lt;/span&gt; to have non-negative values. &lt;a href=&#34;https://en.wikipedia.org/wiki/Non-negative_matrix_factorization&#34; target=&#34;_blank&#34;&gt;Non-Negative Matrix Factorization&lt;/a&gt; was invented to force the elements of &lt;span class=&#34;math inline&#34;&gt;\(Z\)&lt;/span&gt;, and &lt;span class=&#34;math inline&#34;&gt;\(B\)&lt;/span&gt; to be positive.&lt;/p&gt;
&lt;p&gt;2- Moreover, the algorithm presented above assumes no probability distribution. Consequently, it is inapropriate to use &lt;span class=&#34;math inline&#34;&gt;\(Z\)&lt;/span&gt; for inferential studies (Inferential studies build on probabilistic assumption of the data generating process). Probabilistic matrix factorization algorithms address these concerns. These methods include probabilistic Principal Component Analysis (PPCA), Multinomial Principal Component Analysis (mPCA), Probabilistic Latent Semantic Analysis (PLSA), Latent Dirichlet Allocation (LDA), etc…&lt;/p&gt;
&lt;p&gt;3- Traditional matrix factorization methods implicitly or explicitly assume multivariate normal distribution, and decomposes the covariance matrix of the data. Factor Analysis (FA) and Principal Component Analysis (PCA) are two examples.&lt;/p&gt;
&lt;p&gt;I hope this introductory exposition of topic modeling provides an intuitive understanding of the why, and how of the subject. Feel free to leave your comments below.&lt;/p&gt;
&lt;/div&gt;
</description>
    </item>
    
  </channel>
</rss>
