Topic Modeling on Salfo Bikienga

Topic modeling: The Intuition

Fri, 17 Nov 2017 00:00:00 +0000

Introduction

Whenever I give a talk on topic modeling to people not familiar with the subject, the usual question I receive is: “can you provide some intuition behind topic modeling?” Another variant of the same question is: “This is magic. How can the computer identify the topics in the documents?”. No! It is not magic. It is Math. I presented the math behind Latent Dirichlet Allocation, and an example apllication in previous posts. Here is my attempt at providing the intuition from the perspective of someone with basic understanding of simple linear regression, and a bit of matrix algebra.
Topic modeling is a form of matrix factorization. Though modern topic modeling algorithms involve complex probability theory, the basic intuition can be developed through simple matrix factorization.
Matrix factorization can be understood as a form of data dimension reduction method. In a world of “big data”, the usefulness of such method is immense. For instance, linear regression, the most used statistical tool in economics is only applicable when \(n\), the number of observations is at least as big as \(p\), the number of variables. When \(p\) is too big, we resort to some dimension reduction methods such as choosing a few variables based on theory, using stepwise, or LASSO regression. With matrix factorization, we do not have to select variables. We can just “redifined” the variables in a lower dimensional space, that is, convert the \(p\) dimensional data into \(k\) dimensional data, where \(k\) is significantly less than \(p\) (\(k<<p\)). The question is, how does that make sense? It is just matrix algebra, as you will see very soon.

The idea of dimension reduction from matrix factorization perspective

1- Consider measures of length, width, and depth. These are three variables, i.e. three dimensional data. If size is enough information we care about, then volume, that is, Volume = length x width x depth is a good variable. Thus, we can collapse the three variables (length, width, depth) into a single variable, volume and preserve the essential information needed.

2- Consider measures of height, weigth, and waist. These are three variables, i.e. three dimensional data. If size provides enough information for what we need, then some form of linear combination of the three variables (\(size = b_1\times height + b_2 \times weight + b_3 \times waist\)) will do. Thus, we collapse the three dimensional data into a one dimensional data.

3- Consider a dataset of words counts in several documents. Let’s consider the following words: college, drugs, education, graduation, health, medicaid. This is a six dimensional data. If what we care about are the concepts of education and health care, then some form of linear combination of these words counts will do. Our task consists of finding the appropriate weights so that a document having higher counts of education related words than other words gets a high value for the education concept, and low value for the health concept. And a document having higher counts of health related words than other words gets a high value for the health concept, and a low value for the education concept. Thus, we reduce the six dimensional data into two dimensional data, while preserving the essential information we care about.

The idea of matrix factorization

The idea of matrix factorization stems from the fact that any matrix can be decomposed into the product of two or more matrices. Let \(W_{n,p}\) be a matrix of dataset with \(n\) rows and \(p\) columns. We can write the same matrix as the product of two matrices, such as: \[\begin{equation} W_{n,p} \simeq Z_{n,k}B_{k,p} \label{eq:fac1} \end{equation}\]

It turns out that \(Z\) preserves the essential information needed to understand variations between the \(n\) observations in \(W\).

Illustrative example

Let \(W_{n,p}\) be a spreadsheet of words counts in \(n\) documents. \(p\) is the number of unique words, and can be seen as variables. Here, \(n=6\), \(p = 5\), \(k = 2\). Let college, education, family, health, and medicaid be respectively the variables names of the \(W\) matrix.

\[ \underbrace{\begin{bmatrix} 4&6&0&2&2 \\ 0&0&4&8&12 \\ 6&9&1&5&6 \\ 2&3&3&7&10 \\ 0&0&3&6&9 \\ 4&6&1&4&5 \\ \end{bmatrix} }_{\mathbf{W_{6,5}}} = \underbrace{\begin{bmatrix} 2&0 \\ 0&4 \\ 3&1 \\ 1&3 \\ 0&3 \\ 2&1 \\ \end{bmatrix} }_{\mathbf{Z_{6,2}}} \underbrace{\begin{bmatrix} 2&3&0&1&1 \\ 0&0&1&2&3 \\ \end{bmatrix} }_{\mathbf{B_{2,5}}} \]

It is easy to check that the product holds, that is, \(Z_{6,2}B_{2,5} = W_{6,5}\). \(Z\) contains most of the information about the observations contained in \(W\). With \(Z\), we can explore, or study the variation between the observations, easily. \(Z\) is a two dimensional representation of \(W\), and a simple scatterplot can be used to explore the data, as shown in the plot below.

Z <- matrix(c(2, 0,
             0, 4,
             3, 1,
             1, 3,
             0, 3,
             2, 1), byrow = TRUE, nrow = 6)
Z <- data.frame(z1 = Z[,1], z2 = Z[, 2])
plot(x = Z$z1, y = Z$z2, cex = 3)
text(x = Z$z1, y = Z$z2, labels= 1:6, cex= 1)

From the plot, we can deduce that observations (or documents) 2, 4 and 5 are close to each other; observations 1, 6 and 3 are also close to each other. The point here is that with a reduced dimension, it is easier to draw some insight from the data. Hence, the benefit of matrix factorization for data analysis.

For a predictive modeling exercise, we replace the \(W\) matrix with the \(Z\) matrix, and the usual tools (linear regression, logistic regression, regression tree, etc.) can be used. We do not have to understand the meaning of the new variables \(Z\). We only care about their ability to predict.
However, for exploratory and inferential data analysis, we want to understand the meaning of the new variables \(Z\). To tell a story, we have to know the meaning of the variables. We infer the meaning of the new variables by inspecting the \(B\) matrix. I will explain why that is the case shortly. For now, note that the number of columns of \(B\) is the number of columns of the \(W\) matrix; and the number of its rows is the number of columns of the \(Z\) matrix. Each row of \(B\) is used to interpret the meaning of each column of \(Z\). Row \(i\) of \(B\) is used to infer the meaning of column \(i\) of \(Z\). For instance, referring to our illustrative example above, row 1 of \(B\) has its biggest values at its first and second column, that is variables 1 and 2 (college and education) of the \(W\) matrix are dominant in the identification of the meaning of the first \(Z\) variable. Likewise, row 2 of \(B\) has its biggest values at its two last columns; the variables 4 and 5 (health and medicaid) of the \(W\) matrix are dominant in the identification of the meaning of the second \(Z\) variable. Thus, the columns of \(Z\) represent, respectively, measures of education and health concepts.

Finding \(Z\) and \(B\)

There are several matrix factorization algorithms. Factor Analysis (FA), Principal Component Analysis (PCA), Non Negative Matrix Factorization (NMF), Probabilistic Semantic Analysis (PLSA) and its variants, etc. Since our goal for this introduction is to present the basic idea, let’s present an algorithm that is closer to something we are all familiar with: Ordinary Least Squares (OLS).

Multivariate OLS

From introductory statistics, we know that for: \[y_{n,1} = X_{n,p}\beta_{p,1} + \epsilon_{n,1}\] the least squares solution for \(\beta_{p,1}\) is: \[\hat\beta_{p,1} = (X^tX)^{-1}X^ty\] We are assuming that \((X^tX)^{-1}\) exists. \(t\) stands for transpose, and \(-1\) stands for inverse.

In case you do not remember this formula, recall that: \[y_{n,1} = X_{n,p}\beta_{p,1} + \epsilon_{n,1} \Leftrightarrow X^tY = (X^tX)\beta + X^t\epsilon\] Under the assumptions of no correlation between \(X\) and \(\epsilon\), and \(E(\epsilon) = 0\), we can set \(X^t\epsilon=0\). So we have: \[X^tY = (X^tX)\beta \\ \Leftrightarrow \\ (X^tX)^{-1}X^tY = (X^tX)^{-1}(X^tX)\beta \\ \Rightarrow \\ \hat\beta = (X^tX)^{-1}X^tY \]

For a more than a single left hand side variable \(y_{n,1}\), the same formula applies; and we have: \[\hat B = (X^tX)^{-1}X^tY\] where \(B\) is a \(p\times q\) matrix, and \(Y\) is a \(n \times q\) matrix.

Multivariate OLS and matrix factorization

What does multivariate regression have to do with matrix factorization? Note that, ignoring the \(\epsilon\), we could have written: \[\begin{equation} Y_{n,q} \simeq X_{n,p}B_{p,q} \label{eq:ols} \end{equation}\]

This equation is very similar to the equation \(W_{n,p} \simeq Z_{n,k}B_{k,p}\), except \(X\) is observed for the case of the multivariate OLS.

In multivariate OLS, we only estimate \(B\). For matrix factorization, we estimate \(Z\) and \(B\).
From \(W \simeq ZB\), we can solve for \[\hat B = (Z^tZ)^{-1}Z^tW\] or \[\hat Z = WB^t(BB^t)^{-1}\] The predicted values for \(W\) is: \[\hat W = \hat Z \hat B\] To estimate \(B\), we need \(Z\), and to estimate \(Z\) we need \(B\). We do not have either one. The trick is to guess some initial values for \(Z\), and use it to estimate \(B\), then use the estimated \(B\) to estimate a new \(Z\). Use the new \(Z\) to estimate a new \(B\). Continue the iteration untill some stopping criterion. Thus, we estimate \(Z\) and \(B\) iteratively (This estimation method is known as Alternating Least Squares). When do we stop the iteration?

Again, \(\hat W = \hat Z \hat B\) is the predicted values for \(W\). We itterate until the distance between \(W\) and its predicted value, \(\hat W\), is negligible. There are several distance measures, but let’s keep things simple by using the euclidean distance, or \(L_2\) norm: \[Q(\hat Z, \hat B) = ||W-\hat W (\hat Z, \hat B)||_2 = \sqrt{\sum_{i = 1}^n \sum_{j = 1}^p (w_{i,j} - \hat w_{i,j})^2}\] Thus, we minimize \(Q\), the objective function. Following is an example implementation of a simple alternating least squares algorithm.

W <- matrix(c(4,    6,    0,    2,    2,
             0,    0,    4,    8,   12,
             6,    9,    1,    5,    6,
             2,    3,    3,    7,   10,
             0,    0,    3,    6,    9,
             2,    6,    1,    4,    5), byrow = TRUE, nrow = 6)

set.seed(3)
Z_init <- abs(round(rnorm(n = 6*2, mean = 0, sd = 2),0))
Z_init <- matrix(Z_init, nrow = 6)

Z <- Z_init
dist_ww <- 1e3
max_iter <- 1000
iter <- 0
while(iter <= max_iter && dist_ww >= 1e-6) {
  iter <- iter + 1
  ZZ_inv <- solve(t(Z)%*%Z)
  B <- ZZ_inv%*%t(Z)%*%W
  BB_inv <- solve(B%*%t(B))
  Z <- W%*%t(B)%*%BB_inv
  W_hat <- Z%*%B
  dist_ww <- sqrt(sum(W-W_hat)^2)
}
W <- data.frame(W)
names(W) <- c("college", "education", "family", "health", "medicaid")
Z <- data.frame(round(Z, 2))
row.names(Z) <- paste0("document.", 1:6)
names(Z) <- c("Topic.1", "Topic.2")
B <- data.frame(round(B, 2), row.names = c("Topic.1", "Topic.2"))
names(B) <- c("college", "education", "family", "health", "medicaid")

Below is the table of the least squares estimate of\(B\):

##         college education family health medicaid
## Topic.1    1.18      1.96  -0.02    0.6     0.58
## Topic.2    0.50      0.85   1.11    2.5     3.60

Observe that row 1 of \(B\) has high values in columns 1 and 2 compared to columns 3, 4, and 5; and row 2 has higher values for columns 4 and 5 compared to columns 1, 2, and 3. It is reasonable to infer that row 1 (Topic.1) refers to education, and row 2 (Topic.2) refers to health.

Below is the the table of the least squares estimate of \(Z\):

##            Topic.1 Topic.2
## document.1    3.13    0.05
## document.2   -1.55    3.58
## document.3    4.31    0.97
## document.4    0.41    2.71
## document.5   -1.16    2.68
## document.6    2.26    1.03

Observe that Topic.1 has big values in documents 1, 4, and 6. Likewise, Topic.2 has big values in documents 2, 4, and 5. Hence, we can infer that documents 1, 4, and 6 are mostly about education; and documents 2, 4, and 5 are mostly about health.

We can use a scatterplot to explore the original five dimensional \(W\) data in a two dimensional \(Z\) data as follow:

plot(x = Z$Topic.1, y = Z$Topic.2, cex = 3, 
     xlab = "Topic.1", ylab = "Topic.2")
text(x = Z$Topic.1, y = Z$Topic.2, labels= 1:6, cex= 1)

Uniqueness of the solution

The solution is not unique, as you might have noticed (note the difference in Z and B from the illustrative example and the computed Z and B) eventhough \(W\) remains the same. To see why, assume \(T\) is an orthonormal matrix, that is, \(T\) is such that \(TT^t = I\). Then, \(W \simeq ZB = ZTT^tB = (ZT)(T^tB) = Z^*B^*\) where \(Z^* = ZT\), and \(B^* = T^tB\). Thus, (\(Z\), \(B\)) and (\(Z^*\), \(B^*\)) are both equally valid solutions. Therefore, the solution is not unique. This non uniqueness of the solution poses some challenges for inferential studies based on the reduced dimension.

Interpreting the new variables

Again, we use the rows of the \(B\) matrix to infer the meaning of each column of \(Z\). Why? Observed that \[\hat B = (Z^tZ)^{-1}Z^tW\] Let’s define \(F = (Z^tZ)^{-1}Z^t\) with elements \(f_{i,j}\), that is, \(f_{i,j}\) is the value in the \(i^{th}\) row, \(j^{th}\) column of the matrix \(F\). Thus, \(\hat B = FW\) \[ \hat B_{k,p}=\begin{bmatrix}b_{1,1} & b_{1,2} & \cdots & b_{1,p}\\ b_{2,1} & b_{2,2} & \cdots & b_{2,p}\\ \vdots & \vdots & \ddots & \vdots\\ b_{k,1} & b_{k,2} & \cdots & b_{k,p} \end{bmatrix} = \begin{bmatrix}f_{1,1} & f_{1,2} & \cdots & f_{1,n}\\ f_{2,1} & f_{2,2} & \cdots & f_{2,n}\\ \vdots & \vdots & \ddots & \vdots\\ f_{k,1} & f_{k,2} & \cdots & f_{k,n} \end{bmatrix} \begin{bmatrix}w_{1,1} & w_{1,2} & \cdots & w_{1,p}\\ w_{2,1} & w_{2,2} & \cdots & w_{2,p}\\ \vdots & \vdots & \ddots & \vdots\\ w_{n,1} & w_{n,2} & \cdots & w_{n,p} \end{bmatrix} \]

If you still remember matrix operations from high school, note that: \[b_{1,1} = \sum_{l=1}^nf_{1,l}\times w_{l,1} \\ = f_{1,1}w_{1,1}+f_{1,2}w_{2,1}+f_{1,3}w_{3,1}+\cdots+f_{1,n}w_{n,1}\] \[b_{1,2} = \sum_{l=1}^nf_{1,l}\times w_{l,2} \\ = f_{1,1}w_{1,2}+f_{1,2}w_{2,2}+f_{1,3}w_{3,2}+\cdots+f_{1,n}w_{n,2}\]

Observe that the source of any numerical difference between \(b_{1,1}\) and \(b_{1,2}\) is the numerical difference between the first and second column of \(W\) (the \(f_{i,j}\) are exactly the same). Also, observe that, whatever \(F\) is, \(b_{1,1}\) is a total weight of the first variable \(W_1\) (say the counts of word 1 in all the documents). Likewise, \(b_{1,2}\) is a total weight of the second variable \(W_2\) (say the count of the second word in all the documents); and so on untill \(b_{1,p}\). Put differently, \(b_{1,j}\) is a total weight of the word \(W_j\). Thus, the coefficients \([b_{1,1},b_{1,2}, \cdots,b_{1,p}]\) are the total weight of the words \(W_1, W_2, \cdots, W_p\), respectively. If these are words’ weights, it is natural to use the words with highest weights to name row 1 of \(B\). We name the remaining rows of \(B\) in similar fashion.

Also, observe that the elements of the first row of \(B\) are the coefficients of the first column of \(Z\). If row 1 of \(B\) is named, say education for example, then the first column of \(Z\) is an education variable. Hence, the naming of the columns of \(Z\).

The values of the new variables \(Z\)

Again, we have \[\hat Z = WB^t(BB^t)^{-1}\] Let’s define \[N = B^t(BB^t)^{-1}\] Then \[\hat Z = WN\] That is \[ \hat{Z} = \begin{bmatrix} z_{1,1} & z_{1,2} & \cdots & z_{1,k}\\ z_{2,1} & z_{2,2} & \cdots & z_{2,k}\\ \vdots & \vdots & \ddots & \vdots\\ z_{n,1} & z_{n,2} & \cdots & z_{n,k} \end{bmatrix} = \begin{bmatrix} w_{1,1} & w_{1,2} & \cdots & w_{1,p}\\ w_{2,1} & w_{2,2} & \cdots & w_{2,p}\\ \vdots & \vdots & \ddots & \vdots\\ w_{n,1} & w_{n,2} & \cdots & w_{n,p} \end{bmatrix} \begin{bmatrix} n_{1,1} & n_{1,2} & \cdots & n_{1,k}\\ n_{2,1} & n_{2,2} & \cdots & n_{2,k}\\ \vdots & \vdots & \ddots & \vdots\\ n_{p,1} & n_{p,2} & \cdots & n_{p,k} \end{bmatrix} \]

Observe that \[z_{1,1} = \sum_{m = 1}^p n_{m,1}w_{1,m} \\ = n_{1,1}w_{1,1}+ n_{2,1}w_{1,2}+n_{3,1}w_{1,3}+\cdots+n_{p,1}w_{1,p}\] \[z_{1,2} = \sum_{m = 1}^p n_{m,2}w_{1,m} \\ = n_{1,2}w_{1,1}+ n_{2,2}w_{1,2}+n_{3,2}w_{1,3}+\cdots+n_{p,2}w_{1,p}\] \[z_{2,1} = \sum_{m = 1}^p n_{m,1}w_{2,m} \\ = n_{1,1}w_{2,1}+ n_{2,1}w_{2,2}+n_{3,1}w_{2,3}+\cdots+n_{p,1}w_{2,p}\] The numerical difference between \(z_{1,1}\) and \(z_{1,2}\) stems from the numerical difference between the weights in column \(1\) and \(2\) of the weights matrix \(N\) (\(N\) can be seen as a weight matrix). The numerical difference between \(z_{1,1}\) and \(z_{2,1}\) stems from the numerical difference between the words counts in documents \(1\) and \(2\) of the words counts matrix \(W\).

Alternatively, we can think of \(Z\) as a composite index matrix. \(z_{i,j}\) is the value of the index \(j\) in document \(i\). For example, \(z_{1,1}\) is the value of index \(1\) in document \(1\); \(z_{1,2}\) is the value of index \(2\) in document \(1\). Why different index values for the same document? Because each index assigns different weights to the same words. For index \(1\), the weights are the \(n_{m,1}\) (\(m=\{1, 2,\cdots,p\}\)). For the index \(2\), the weights are \(n_{m,2}\). And for the \(k^{th}\) index, the weights are \(n_{m,k}\).

Some variants of the matrix factorization

1- Note that our working example data \(W\) is a count data. Naturally, we would want \(Z\) and \(B\) to have non-negative values. Non-Negative Matrix Factorization was invented to force the elements of \(Z\), and \(B\) to be positive.

2- Moreover, the algorithm presented above assumes no probability distribution. Consequently, it is inapropriate to use \(Z\) for inferential studies (Inferential studies build on probabilistic assumption of the data generating process). Probabilistic matrix factorization algorithms address these concerns. These methods include probabilistic Principal Component Analysis (PPCA), Multinomial Principal Component Analysis (mPCA), Probabilistic Latent Semantic Analysis (PLSA), Latent Dirichlet Allocation (LDA), etc…

3- Traditional matrix factorization methods implicitly or explicitly assume multivariate normal distribution, and decomposes the covariance matrix of the data. Factor Analysis (FA) and Principal Component Analysis (PCA) are two examples.

I hope this introductory exposition of topic modeling provides an intuitive understanding of the why, and how of the subject. Feel free to leave your comments below.

Topic Modeling: An Application

Sat, 11 Nov 2017 00:00:00 +0000

Introduction

My work involves the use and the development of topic modeling algorithms. A surprising challenge I have had is communicating the output of topic modeling algorithms to people not familiar with text analytics. Here is my 10 cents explanation of the LDA output to my econ friends.

The use of text data for economic analysis is gaining attractions. One popular analytical tool is Latent Dirichlet Allocation (LDA), also called topic modeling (Blei, Ng, and Jordan 2003). Succinctly put, topic modeling consists of collapsing a matrix (i.e a spreadsheet) of words counts into a reduced matrix of topics’ proportions within documents. For instance, assume we have a collection of 500 documents, each containing 2000 unique words; this collection of documents (called corpus) can be represented as a dataset of 500 observations and 2000 variables (each word being a variable). Each cell in the matrix represents the count of a word in a document. The matrix is just a regular spreadsheet of data. Clearly, it is almost impossible to draw any insight from that many variables. LDA allows us to collapse the high dimensional dataset into a lower dimension, say a dimension of 10. With 10 variables, there is a hope that some insight can be drawn from the data. Following is a demonstration of LDA.

Example Data

Let’s consider a dataset of U.S. governors’ State of the State Addresses (SoSA). In most states, the governor gives a speech, generally in January, in which he/she lays out his/her priorities for the next fiscal year. Part of the goal of the speech is to explain (or justify) the proposed budget, and hopefully convince the state stakeholders to support the proposed budget. A budget proposal usually involves a reallocation of the state resources, which implies cuts and increases in different lines of the state budget. I collected 596 speeches from governors of the 50 states, spanning from 2001 to 2013.

It is customary in text analytics to delete words that we believe are not “discriminative”. For instance link words such as “the”, “and”, “she”, etc. will not distinguish a Democrat from a Republican. We call this process, pre-processing the data, that is, cleaning the data by removing elements in the texts that we believe are not useful for our analysis.

After pre-processing the data, I am left with a dataset of 596 observations and 1034 words (or variables). You can take a look at the pre-processed data here, or you can download it here. Stemming, that is stripping the words to their roots, is often done to avoid counting related words separately. For example, education, educational, educate are stemmed and become educ.

Example application of LDA

The goal when using LDA is primarily to reduce the dimension of a counts dataset. The hope is that the reduced dimension preserves the essential information contained in the original dataset. Interestingly, the reduced dimension is often more appropriate for statistical analysis, as it “solves” the overfitting problem associated with high dimensional data. Generally, the overfitting problem arises in situations where \(n\), the number of observations, is not big enough to provide reliable estimates of the \(p\) variables’ parameters.

There are several packages in R to implement the LDA model (lda, mallet, and topicmodels). Here I will use the topicmodels package as an example.

# install.packages("topicmodels") # You should run this code once if you don't have topicmodels installed
library(topicmodels) # Load the topicmodels package
url <- url("https://github.com/Salfo/States-Addresses/raw/master/data/SoSA_data_df.RData")
load(url) # Load the data from the url provided
SoSA_topics <- LDA(SoSA_data_df, # The matrix of words counts
                   k = 2, # The number of topics to construct
                   method = "Gibbs", # Estimation method
                   control = list(iter = 3000, # Number of iterations
                                  burnin = 1000, # Thow out the first 1000 estimates
                                  seed = 123)) # To get a reproducible results

Note that LDA is a matrix factorization algorithm, and a matrix factorization consists of decomposing a matrix into the product of two or more matrices. Intuitively, we can write: \[W_{D,V} \simeq \theta_{D,V}\phi_{K,V}\]

The reduced dimension, \(\theta\) matrix

In this example, \(D=596\), \(V=1034\), \(K=2\). \(\theta\) contains the essential information needed to understand the variation between observations, concerning the speeches. For instance, \(\theta\) can be used to study how Democrats differ from Republicans regarding the relative importance of themes they cover in their speeches. \(\theta\) can be seen as a regular spreadsheet of data, as shown below. For an extended exposition of LDA, see this.

theta_matrix <- posterior(SoSA_topics)$topics # Extract the theta matrix
theta_matrix <- round(as.data.frame(theta_matrix), digits = 3)
names(theta_matrix) <- paste("Topic.", 1:2, sep = "") # Name the columns
head(theta_matrix, n = 10) # Print out the first 10 observations

##                       Topic.1 Topic.2
## Alabama_2001_D_1.txt    0.274   0.726
## Alabama_2002_D_2.txt    0.377   0.623
## Alabama_2003_R_3.txt    0.767   0.233
## Alabama_2004_R_4.txt    0.613   0.387
## Alabama_2005_R_5.txt    0.484   0.516
## Alabama_2006_R_6.txt    0.513   0.487
## Alabama_2007_R_7.txt    0.424   0.576
## Alabama_2008_R_8.txt    0.481   0.519
## Alabama_2009_R_9.txt    0.516   0.484
## Alabama_2010_R_10.txt   0.583   0.417

How do we know which themes are covered?

Well, here we imposed the number of themes by setting \(K=2\). To identify the themes, we use the matrix \(\phi\), which presents the relative importance of each word for each theme (or topic).

phi_matrix <- posterior(SoSA_topics)$terms # Extract the phi matrix
phi_matrix <- round(phi_matrix, 3) # Round the numbers to 3 decimals
phi_matrix[, 1:20] # Print out the first 20 words

##    abil  abus academ acceler accept access accomplish accord account
## 1 0.001 0.001  0.000       0  0.001  0.000      0.000      0   0.002
## 2 0.000 0.001  0.001       0  0.000  0.003      0.001      0   0.001
##   achiev acknowledg across action activ actual addit address adequ
## 1  0.001      0.001  0.001  0.001 0.001  0.001 0.003   0.005     0
## 2  0.002      0.000  0.002  0.001 0.001  0.000 0.001   0.001     0
##   administr adopt
## 1     0.003 0.001
## 2     0.000 0.000

It might be more helpful to transpose the \(\phi\) so that by sorting each topic by decreasing order of the words relative weights we can identify the first few most important (in terms of weight) words for the given topic.

T_phi_matrix <- as.data.frame(t(phi_matrix))
names(T_phi_matrix) <- paste("Topic.", 1:2)
T_phi_matrix[1:20, ] # Print out the first 20 words

##            Topic. 1 Topic. 2
## abil          0.001    0.000
## abus          0.001    0.001
## academ        0.000    0.001
## acceler       0.000    0.000
## accept        0.001    0.000
## access        0.000    0.003
## accomplish    0.000    0.001
## accord        0.000    0.000
## account       0.002    0.001
## achiev        0.001    0.002
## acknowledg    0.001    0.000
## across        0.001    0.002
## action        0.001    0.001
## activ         0.001    0.001
## actual        0.001    0.000
## addit         0.003    0.001
## address       0.005    0.001
## adequ         0.000    0.000
## administr     0.003    0.000
## adopt         0.001    0.000

The terms() function of the topicmodels package returns a convenient \(\phi\) matrix that replaces the words weights by the words themselves, after sorting each row of the \(\phi\) matrix.

terms_matrix <- terms(SoSA_topics, 30) # Extract the first 30 most important words for each topic
terms_matrix[1:15, ] # Print out the first 15 words

##       Topic 1   Topic 2   
##  [1,] "budget"  "school"  
##  [2,] "fund"    "work"    
##  [3,] "govern"  "educ"    
##  [4,] "peopl"   "help"    
##  [5,] "million" "children"
##  [6,] "work"    "make"    
##  [7,] "make"    "famili"  
##  [8,] "public"  "nation"  
##  [9,] "propos"  "busi"    
## [10,] "servic"  "creat"   
## [11,] "dollar"  "health"  
## [12,] "know"    "student" 
## [13,] "spend"   "invest"  
## [14,] "increas" "teacher" 
## [15,] "program" "care"

By exploring the most important words for each topic, it seems reasonable to infer that Topic.1 is about “money”, the budget; and Topic.2 is mostly about education.

In sum, \(\theta\) provides the essential information needed to understand variations or differences between observations; \(\phi\) is used to infer the meaning of each of the \(K\) columns of \(\theta\).

Using \(\theta\) for statistical analysis

Of what uses can we make of \(\theta\)? Quite a lot!

\(\theta\) alone, or combined with other control variables, \(X\), can be used for regular statistical analysis. \(\theta\) has been used for economic analyses. (Brown, Crowley, and Elliott 2016) applied LDA to assess whether the thematic content of financial statement disclosures is informative in predicting intentional misreporting. (Hansen and McMahon 2016) uses LDA in a Factor Augmented Vector Autoregressive modeling framework. I have a working paper exploring the relationship between US governors commitments to their economic agenda as stated in their public statements and the expansion of business establishments in their states (Bikienga 2017). For a survey of the use of LDA and other text analytics tools in economics, see (Gentzkow, Kelly, and Taddy 2017).

Illustration of the use of \(\theta\)

Is there any difference between Democrats and Republicans based on the themes covered in their speeches? To answer this question, we can compute the mean values of the topics by party line. Note that D, R, or I is appended to the rownames of the \(\theta\) shown above. They stand for Democrat, Republican, or Independent.

Here, I am using the rownames to construct additional variables (state, party, and year)

library(stringr)
state_vars <- row.names(theta_matrix) %>% 
  str_split(pattern = "_") %>% as.data.frame() %>% t()
state_vars <- state_vars[, -4]
state_vars <- data.frame(state_vars)
names(state_vars) <- c("state", "year", "party")
df <- data.frame(theta_matrix, state_vars)
n_obs <- sample(1:596, size = 10)
sample_obs <- df[n_obs,]
sample_obs

##                            Topic.1 Topic.2       state year party
## Florida_2009_R_94.txt        0.381   0.619     Florida 2009     R
## Kansas_2009_D_171.txt        0.422   0.578      Kansas 2009     D
## Maryland_2003_R_204.txt      0.435   0.565    Maryland 2003     R
## Illinois_2010_D_139.txt      0.579   0.421    Illinois 2010     D
## SouthDakota_2007_R_405.txt   0.378   0.622 SouthDakota 2007     R
## Tennessee_2002_R_411.txt     0.399   0.601   Tennessee 2002     R
## Florida_2004_R_89.txt        0.217   0.783     Florida 2004     R
## RhodeIsland_2002_R_534.txt   0.375   0.625 RhodeIsland 2002     R
## Alabama_2003_R_3.txt         0.767   0.233     Alabama 2003     R
## Minnesota_2008_R_241.txt     0.387   0.613   Minnesota 2008     R

Compute the topics’ means by party line.

library(dplyr)
library(tidyr)
df_by_party <- df %>%
  group_by(party) %>%
summarise(Topic.1 = mean(Topic.1), Topic.2 = mean(Topic.2)) %>%
  gather(Topic, Topic_proportion, Topic.1:Topic.2) %>%
  mutate(Topic_proportion = round(100*Topic_proportion, 0)) %>%
  mutate(pos = c(rep(75, 3), rep(25, 3)))
df_by_party

## # A tibble: 6 x 4
##   party Topic   Topic_proportion   pos
##   <fct> <chr>              <dbl> <dbl>
## 1 D     Topic.1              46.   75.
## 2 I     Topic.1              62.   75.
## 3 R     Topic.1              51.   75.
## 4 D     Topic.2              54.   25.
## 5 I     Topic.2              38.   25.
## 6 R     Topic.2              49.   25.

Democrats seem to talk more about education (Topic.2) than Republicans. On average, about 54% of their speeches refers to the education theme, against 49% for Republicans. Conversely, Republicans tend to talk more about budgetary issues than Democrats (51% for Republicans vs. 46% for Democrats).

Clearly, these differences are not huge, and we cannot put too much stock into it. The goal here is to illustrate how one may use the topics distributions, without going into the intricacies of statistical significance.

The above table can be visualized with the help of a stacked bar plot.

library(ggplot2)
library(ggthemes)
library(extrafont)
#library(plyr)
#library(scales)
fill <- c("#add8e6", "#b87333")
p_party <- ggplot() +
  geom_bar(aes(y = Topic_proportion, x = party, fill = Topic), 
           data = df_by_party, stat="identity") +
  geom_text(data=df_by_party, aes(x = party, y = pos, label = paste0(Topic_proportion,"%")),
            colour="black", family="Tahoma", size=4) +
  theme(legend.position="bottom", legend.direction="horizontal",
        legend.title = element_blank()) +
  labs(x="Political Party", y="Percentage") +
  ggtitle("Average Proportion of Topic Covered By Party (%)") +
  scale_fill_manual(values=fill) +
  theme(axis.line = element_line(size=1, colour = "black"),
        panel.grid.major = element_line(colour = "#d3d3d3"), panel.grid.minor = element_blank(),
        panel.border = element_blank(), panel.background = element_blank()) +
  theme(plot.title = element_text(size = 14, family = "Tahoma", face = "bold"),
        text=element_text(family="Tahoma"),
        axis.text.x=element_text(colour="black", size = 10),
        axis.text.y=element_text(colour="black", size = 10))
p_party

Should we trust the results?

Yes! We should. A mental block I faced when I started exploring topic modeling is trusting the results. If your program is like mine, latent variables models are not covered in your econometrics classes, even though they are widely used in the economics literature. In Macroeconomics, they are termed Factor Augmented Vector Autoregressive models. In Development Economics, they are used to construct indices (Bérenger and Verdier-Chouchane 2007, @Tabellini2010). Factor models approaches are also used as instruments (Bai and Ng 2010).

But, LDA is just another factor model algorithm. It is closely related to principal component analysis (PCA). In the future, I will present the idea of factor models, and why they are “reliable”.

#Conclusion

In sum, topic modeling in general and LDA in particular is a dimension reduction method. It consists of collapsing a matrix of words counts into a reduced matrix of topics distributions. This illustration provides a sense of its usefulness for statistical analysis.

References

Bai, Jushan, and Serena Ng. 2010. “Instrumental Variable Estimation in a Data Rich Environment.” Econometric Theory 26 (6). Cambridge University Press: 1577–1606.

Bérenger, Valérie, and Audrey Verdier-Chouchane. 2007. “Multidimensional Measures of Well-Being: Standard of Living and Quality of Life Across Countries.” World Development 35 (7). Elsevier: 1259–76.

Bikienga, Salfo. 2017. “The Governor as the Entrepreneur in Chief: An Exploratory Analysis.”

Blei, David M., Andrew Y. Ng, and Michael I. Jordan. 2003. “Latent Dirichlet Allocation.” J. Mach. Learn. Res. 3 (March). JMLR.org: 993–1022. http://dl.acm.org/citation.cfm?id=944919.944937.

Brown, Nerissa C, Richard M Crowley, and W Brooke Elliott. 2016. “What Are You Saying? Using Topic to Detect Financial Misreporting.”

Gentzkow, Matthew, Bryan T Kelly, and Matt Taddy. 2017. “Text as Data.” National Bureau of Economic Research.

Hansen, Stephen, and Michael McMahon. 2016. “Shocking Language: Understanding the Macroeconomic Effects of Central Bank Communication.” Journal of International Economics 99. Elsevier: S114–S133.

Tabellini, Guido. 2010. “Culture and Institutions: Economic Development in the Regions of Europe.” Journal of the European Economic Association 8 (4). Oxford University Press: 677–716.

Topic Modeling: An Application

Sat, 11 Nov 2017 00:00:00 +0000

Introduction

Example Data

Example application of LDA

There are several packages in R to implement the LDA model (lda, mallet, and topicmodels). Here I will use the topicmodels package as an example.

# install.packages("topicmodels") # You should run this code once if you don't have topicmodels installed
library(topicmodels) # Load the topicmodels package
url <- url("https://github.com/Salfo/States-Addresses/raw/master/data/SoSA_data_df.RData")
load(url) # Load the data from the url provided
SoSA_topics <- LDA(SoSA_data_df, # The matrix of words counts
                   k = 2, # The number of topics to construct
                   method = "Gibbs", # Estimation method
                   control = list(iter = 3000, # Number of iterations
                                  burnin = 1000, # Thow out the first 1000 estimates
                                  seed = 123)) # To get a reproducible results

The reduced dimension, \(\theta\) matrix

theta_matrix <- posterior(SoSA_topics)$topics # Extract the theta matrix
theta_matrix <- round(as.data.frame(theta_matrix), digits = 3)
names(theta_matrix) <- paste("Topic.", 1:2, sep = "") # Name the columns
head(theta_matrix, n = 10) # Print out the first 10 observations

##                       Topic.1 Topic.2
## Alabama_2001_D_1.txt    0.274   0.726
## Alabama_2002_D_2.txt    0.377   0.623
## Alabama_2003_R_3.txt    0.767   0.233
## Alabama_2004_R_4.txt    0.613   0.387
## Alabama_2005_R_5.txt    0.484   0.516
## Alabama_2006_R_6.txt    0.513   0.487
## Alabama_2007_R_7.txt    0.424   0.576
## Alabama_2008_R_8.txt    0.481   0.519
## Alabama_2009_R_9.txt    0.516   0.484
## Alabama_2010_R_10.txt   0.583   0.417

How do we know which themes are covered?

Well, here we imposed the number of themes by setting \(K=2\). To identify the themes, we use the matrix \(\phi\), which presents the relative importance of each word for each theme (or topic).

phi_matrix <- posterior(SoSA_topics)$terms # Extract the phi matrix
phi_matrix <- round(phi_matrix, 3) # Round the numbers to 3 decimals
phi_matrix[, 1:20] # Print out the first 20 words

##    abil  abus academ acceler accept access accomplish accord account
## 1 0.001 0.001  0.000       0  0.001  0.000      0.000      0   0.002
## 2 0.000 0.001  0.001       0  0.000  0.003      0.001      0   0.001
##   achiev acknowledg across action activ actual addit address adequ
## 1  0.001      0.001  0.001  0.001 0.001  0.001 0.003   0.005     0
## 2  0.002      0.000  0.002  0.001 0.001  0.000 0.001   0.001     0
##   administr adopt
## 1     0.003 0.001
## 2     0.000 0.000

T_phi_matrix <- as.data.frame(t(phi_matrix))
names(T_phi_matrix) <- paste("Topic.", 1:2)
T_phi_matrix[1:20, ] # Print out the first 20 words

##            Topic. 1 Topic. 2
## abil          0.001    0.000
## abus          0.001    0.001
## academ        0.000    0.001
## acceler       0.000    0.000
## accept        0.001    0.000
## access        0.000    0.003
## accomplish    0.000    0.001
## accord        0.000    0.000
## account       0.002    0.001
## achiev        0.001    0.002
## acknowledg    0.001    0.000
## across        0.001    0.002
## action        0.001    0.001
## activ         0.001    0.001
## actual        0.001    0.000
## addit         0.003    0.001
## address       0.005    0.001
## adequ         0.000    0.000
## administr     0.003    0.000
## adopt         0.001    0.000

The terms() function of the topicmodels package returns a convenient \(\phi\) matrix that replaces the words weights by the words themselves, after sorting each row of the \(\phi\) matrix.

terms_matrix <- terms(SoSA_topics, 30) # Extract the first 30 most important words for each topic
terms_matrix[1:15, ] # Print out the first 15 words

##       Topic 1   Topic 2   
##  [1,] "budget"  "school"  
##  [2,] "fund"    "work"    
##  [3,] "govern"  "educ"    
##  [4,] "peopl"   "help"    
##  [5,] "million" "children"
##  [6,] "work"    "make"    
##  [7,] "make"    "famili"  
##  [8,] "public"  "nation"  
##  [9,] "propos"  "busi"    
## [10,] "servic"  "creat"   
## [11,] "dollar"  "health"  
## [12,] "know"    "student" 
## [13,] "spend"   "invest"  
## [14,] "increas" "teacher" 
## [15,] "program" "care"

By exploring the most important words for each topic, it seems reasonable to infer that Topic.1 is about “money”, the budget; and Topic.2 is mostly about education.

Using \(\theta\) for statistical analysis

Of what uses can we make of \(\theta\)? Quite a lot!

Illustration of the use of \(\theta\)

Here, I am using the rownames to construct additional variables (state, party, and year)

library(stringr)
state_vars <- row.names(theta_matrix) %>% 
  str_split(pattern = "_") %>% as.data.frame() %>% t()
state_vars <- state_vars[, -4]
state_vars <- data.frame(state_vars)
names(state_vars) <- c("state", "year", "party")
df <- data.frame(theta_matrix, state_vars)
n_obs <- sample(1:596, size = 10)
sample_obs <- df[n_obs,]
sample_obs

##                             Topic.1 Topic.2        state year party
## Idaho_2008_R_126.txt          0.648   0.352        Idaho 2008     R
## NewJersey_2009_D_307.txt      0.477   0.523    NewJersey 2009     D
## NewHampshire_2007_D_295.txt   0.277   0.723 NewHampshire 2007     D
## Alabama_2005_R_5.txt          0.484   0.516      Alabama 2005     R
## Tennessee_2013_R_588.txt      0.669   0.331    Tennessee 2013     R
## Wyoming_2010_D_499.txt        0.795   0.205      Wyoming 2010     D
## Washington_2002_D_460.txt     0.446   0.554   Washington 2002     D
## Maine_2005_D_195.txt          0.344   0.656        Maine 2005     D
## Virginia_2011_R_458.txt       0.570   0.430     Virginia 2011     R
## California_2011_D_52.txt      0.679   0.321   California 2011     D

Compute the topics’ means by party line.

library(dplyr)
library(tidyr)
df_by_party <- df %>%
  group_by(party) %>%
summarise(Topic.1 = mean(Topic.1), Topic.2 = mean(Topic.2)) %>%
  gather(Topic, Topic_proportion, Topic.1:Topic.2) %>%
  mutate(Topic_proportion = round(100*Topic_proportion, 0)) %>%
  mutate(pos = c(rep(75, 3), rep(25, 3)))
df_by_party

## # A tibble: 6 x 4
##   party Topic   Topic_proportion   pos
##   <fct> <chr>              <dbl> <dbl>
## 1 D     Topic.1               46    75
## 2 I     Topic.1               62    75
## 3 R     Topic.1               51    75
## 4 D     Topic.2               54    25
## 5 I     Topic.2               38    25
## 6 R     Topic.2               49    25

The above table can be visualized with the help of a stacked bar plot.

library(ggplot2)
library(ggthemes)
library(extrafont)
#library(plyr)
#library(scales)
fill <- c("#add8e6", "#b87333")
p_party <- ggplot() +
  geom_bar(aes(y = Topic_proportion, x = party, fill = Topic), 
           data = df_by_party, stat="identity") +
  geom_text(data=df_by_party, aes(x = party, y = pos, label = paste0(Topic_proportion,"%")),
            colour="black", family="Tahoma", size=4) +
  theme(legend.position="bottom", legend.direction="horizontal",
        legend.title = element_blank()) +
  labs(x="Political Party", y="Percentage") +
  ggtitle("Average Proportion of Topic Covered By Party (%)") +
  scale_fill_manual(values=fill) +
  theme(axis.line = element_line(size=1, colour = "black"),
        panel.grid.major = element_line(colour = "#d3d3d3"), panel.grid.minor = element_blank(),
        panel.border = element_blank(), panel.background = element_blank()) +
  theme(plot.title = element_text(size = 14, family = "Tahoma", face = "bold"),
        text=element_text(family="Tahoma"),
        axis.text.x=element_text(colour="black", size = 10),
        axis.text.y=element_text(colour="black", size = 10))
p_party

Should we trust the results?

Yes! We should. A mental block I faced when I started exploring topic modeling is trusting the results. If your program is like mine, latent variables models are not covered in your econometrics classes, even though they are widely used in the economics literature. In Macroeconomics, they are termed Factor Augmented Vector Autoregressive models. In Development Economics, they are used to construct indices (Bérenger and Verdier-Chouchane 2007; Tabellini 2010). Factor models approaches are also used as instruments (Bai and Ng 2010).

But, LDA is just another factor model algorithm. It is closely related to principal component analysis (PCA). In the future, I will present the idea of factor models, and why they are “reliable”.

Conclusion

References

Bai, Jushan, and Serena Ng. 2010. “Instrumental Variable Estimation in a Data Rich Environment.” Econometric Theory 26 (6). Cambridge University Press: 1577–1606.

Bikienga, Salfo. 2017. “The Governor as the Entrepreneur in Chief: An Exploratory Analysis.”

Blei, David M., Andrew Y. Ng, and Michael I. Jordan. 2003. “Latent Dirichlet Allocation.” J. Mach. Learn. Res. 3 (March). JMLR.org: 993–1022. http://dl.acm.org/citation.cfm?id=944919.944937.

Brown, Nerissa C, Richard M Crowley, and W Brooke Elliott. 2016. “What Are You Saying? Using Topic to Detect Financial Misreporting.”

Gentzkow, Matthew, Bryan T Kelly, and Matt Taddy. 2017. “Text as Data.” National Bureau of Economic Research.

Tabellini, Guido. 2010. “Culture and Institutions: Economic Development in the Regions of Europe.” Journal of the European Economic Association 8 (4). Oxford University Press: 677–716.

Introduction to LDA

Wed, 01 Nov 2017 00:00:00 +0000

Introduction

An important development of text analytics is the invention of the Latent Dirichlet Allocation (LDA) algorithm (also called topic modeling) in 2003. LDA is non negative matrix factorization algorithm. A matrix factorization consists of decomposing a matrix into a product of two or more matrices. It turned out that these linear algebra techniques have applications for data analysis. These applications are generaly referred as data dimension reductions methods. Examples of matrix factorization methods in statistics include Factor Analysis, Principal Component Analysis, and Latent Dirichlet Allocation. They are all latent variables models, which consist of using observed variables to infer the values for unobserved (or hidden) variables. The basic idea of these methods is to find \(\theta_{D,K}\) and \(\phi_{K,V}\) (two sets of hidden variables) from \(W_{D,V}\), the set of observed variables such that: \[W_{D,V} \simeq \theta_{D,K}*\phi_{K,V}\] Where \(D\) is the number of observations, \(V\) is the number of variables; and \(K\) is the number of latent variables. We want \(K<<V\), and “hopefully” we can infer a meaning for each of the \(K\) columns of \(\theta_{D,K}\) from each of the \(K\) rows of \(\phi_{K,V}\). Also, it turned out that most information about the observations (rows of \(W\)) contained in \(W_{D,V}\) is captured in the reduced matrix \(\theta_{D,K}\), hence the idea of data dimension reduction. A major challenge in data dimension reduction is deciding on the appropriate value for \(K\).

To help fix ideas, let’s assume we have exams scores of 100 students on the following subjects: Gaelic, English, History, Arithmetic, Algebra, Geometry (this example is not a text data example, but it is a good one to illustrate the idea of latent variable models). The dataset is \(W_{D,V} = W_{100,6}\); that is, 100 observations and 6 variables. Let’s assume we want to collapse the \(V = 6\) variables into \(K=2\) variables. Let’s further assume that the first variable may be termed “Humanities”, and the second variable may be termed “Math” (this is a sensible assumption!). Thus, we want to create a \(\theta_{100,2}\) matrix that captures most of the informations about the students grades on 6 subjects. With the two variables, humanities and math, we can quickly learn about the students with the help of, for example, a simple scatterplot. The \(\phi\) matrix helps us infer the meanings of the columns of \(\theta\) as humanities and math because (hopefully) one row has big coefficients for Gaelic, English, History, and small coefficients for Arithmetic, Algebra, Geometry; and the second row has big coefficients for Arithmetic, Algebra, Geometry, and small coefficients for Gaelic, English, History. I hope this example provides an intuition of what matrix factorization wants to achieve when used for data analysis. The goal is to reduce the dimension of the data, i.e. reduce the number of variables. The meaning of each of the new variables is inferred by guessing a name associated with the original variables with highest coefficients for a given new variable. In the future, I will provide a numerical example within the context of Factor Analysis. Factor analysis is a building block for understanding latent variables models.

In LDA, the \(W\) matrix is a matrix of words counts, the \(\theta\) matrix is a matrix of topic proporions within each document, and the \(\phi\) matrix is a matrix of each word’s relative importance for each topic.

LDA: the model

This section provides a mathematical exposition of topic modeling and presents the data generative process used to estimate the \(\theta\) and \(\phi\) matrices. LDA is a generative model that represents documents as being generated by a random mixture over latent variables called topics (David M. Blei, Ng, and Jordan 2003). A topic is defined as a distribution over words. For a given corpus (a collection of documents) of D documents each of length \(N_{d}\) , the generative process for LDA is defined as follows:

For each topic \(k\), draw a distribution over words \(\phi_k \sim Dirichlet(\beta)\) with \(k = \{1, 2, ...K\}\)
For each document \(d\):

Draw a vector of topic proportions \(\theta_d \sim Dirichlet(\alpha)\)
For each word \(i\)

Draw a topic assignment \(z_{d,n} \sim multinomial(\theta_d)\) with \(z_{d,n} \in \{1, 2, ..., K\}\)
Draw a word \(w_{d,v} \sim multinomial(\phi_{k = z_{d,n}})\) with \(w_{d,v} \in \{1, 2, ..., V\}\)

Note: Only the words \(w\) are observed.

The above generative process allows us to construct an explicit closed form expression for the joint likelihood of the observed and hidden variables. Markov Chain Monte Carlo (MCMC), and Variational Bayes methods can then be used to estimate the parameters \(\theta\) and \(\phi\) (See David M. Blei, Ng, and Jordan (2003); David M. Blei (2012) for further exposition of the method). We derive the posterior distribution of the \(\theta\)s and \(\phi\)s in the next section.

Deriving the \(\theta\) and \(\phi\) values

A topic \(\phi_{k}\) is a distribution over V unique words, each having a proportion \(\phi_{k,v}\); i.e \(\phi_{k,v}\) is the relative importance of the word v for the definition (or interpretation) of the topic k. It is assumed that:

\[\phi_{k}\sim Dirichlet_{V}(\beta)\] That is: \[p(\phi_{k}|\beta)=\frac{1}{B(\beta)}\prod_{v=1}^{V}\phi_{k,v}^{\beta_{v}-1}\]

Where \(B(\beta)=\frac{\prod_{v=1}^{V}\Gamma(\beta_{v})}{\Gamma(\sum_{v=1}^{V}\beta_{v})}\), and \(\beta=(\beta_{1},...,\beta_{V})\). Since we have K independent topics (by assumption), \[p(\phi|\beta)=\prod_{k=1}^{K}\frac{1}{B(\beta)}\prod_{v=1}^{V}\phi_{k,v}^{\beta_{v}-1}\]

A document d is a distribution over K topics, each having a proportion \(\theta_{d,k}\), i.e. \(\theta_{d,k}\) is the relative importance of the topic k, in the document d. We assume:

\[\theta_{d}\sim Dirichlet_{K}(\alpha)\]

That is:

\[p(\theta_{d}|\alpha)=\frac{1}{B(\alpha)}\prod_{k=1}^{K}\theta_{d,k}^{\alpha_{k}-1}\]

And since we have D independent documents (by assumption),\[p(\theta|\alpha)=\prod_{d=1}^{D}\frac{1}{B(\alpha)}\prod_{k=1}^{K}\theta_{d,k}^{\alpha_{k}-1}\]

It is further assumed that \(\beta_{v}=\beta\), and \(\alpha_{k}=\alpha\)

Let \(z\) be the latent topic assignment variable, i.e. the random variable \(z_{d,n}\) assigns the word \(w_{d,n}\) to the topic k in document \(d\). \(z_{d,n}\) is a vector of zeros and 1 at the \(k^{th}\) position \((z_{d,n}=[0,0,...1,0,..])\). Define \(z_{d,n,k}=I(z_{d,n}=k)\) where \(I\) is an indicator function that assigns 1 to the random variable \(z_{d,n}\) when \(z_{d,n}\) is the topic \(k\), and \(0\) otherwise.We assume:

\[z_{d,n}\sim Multinomial(\theta_{d})\]

That is: \[p(z_{d,n,k}|\theta_{d}) =\theta_{d,k} = \prod_{k=1}^{K}\theta_{d,k}^{z_{d,n,k}}\]

A document is assumed to have \(N_{d}\) independent words, and since we assume D independent documents, we have:

\[p(z|\theta) =\prod_{d=1}^{D}\prod_{n=1}^{N_{d}}\prod_{k=1}^{K}\theta_{d,k}^{z_{d,n,k}}\] \[= \prod_{d=1}^{D}\prod_{k=1}^{K}\prod_{n=1}^{N_{d}}\theta_{d,k}^{z_{d,n,k}}\] \[= \prod_{d=1}^{D}\prod_{k=1}^{K}\prod_{v=1}^{V}\theta_{d,k}^{n_{d,v}*z_{d,v,k}}\] \[= \prod_{d=1}^{D}\prod_{v=1}^{V}\prod_{k=1}^{K}\theta_{d,k}^{n_{d,v}*z_{d,v,k}}\]

\(n_{d,v}\) is the count of the word v in document d.

The word \(w_{d,n}\) is drawn from the topic’s words distribution \(\phi_{k}\):

\[w_{d,n}|\phi_{k=z_{d,n,k}}\sim Multinomial(\phi_{k=z_{d,n}})\]

\[p(w_{d,n} =v|\phi_{k=z_{d,n}})=\phi_{k,v}\] \[= \prod_{v=1}^{V}\prod_{k=1}^{K}\phi_{k,v}^{w_{d,n,v}*z_{d,n,k}}\]

\(w_{d,n}\) is a vector of zeros and 1 at the \(v^{th}\) position. Define \(w_{d,n,v}=I(w_{d,n}=v)\) where \(I\) is an indicator function that assigns \(1\) to the random variable \(w_{d,n}\) when \(w_{d,n}\) is the word \(v\), and \(0\) otherwise.

There are D independent documents, each having \(N_{d}\) independent words, so: \[p(w|\phi)=\prod_{d=1}^{D}\prod_{n=1}^{N_{d}}\prod_{v=1}^{V}\prod_{k=1}^{K}\phi_{k,v}^{w_{d,n,v}*z_{d,n,k}}\]

\[p(w|\phi)=\prod_{d=1}^{D}\prod_{v=1}^{V}\prod_{k=1}^{K}\phi_{k,v}^{n_{d,v}*z_{d,v,k}}\]

The joint distribution of the observed words w and unobserved (or hidden variables) \(\theta\), \(z\), and \(\phi\) is given by:

The goal is to get the posterior distribution of the unobserved variables: \[p(z,\theta,\phi|w,\alpha,\beta)=\frac{P(w,z,\theta,\phi|\alpha,\beta)}{\int\int\sum_{z}P(w,z,\theta,\phi|\alpha,\beta)d\theta d\phi}\]

\(\int\int\sum_{z}P(w,z,\theta,\phi|\alpha,\beta)d\theta d\phi\) is intractable, so approximation methods are used to approximate the posterior distribution. The seminal paper of LDA (David M. Blei, Ng, and Jordan 2003) uses the Mean Field Variational Bayes (an optimization method) to approximate the posteriors distribution (See Bishop (2006), pp. 462 or David M Blei, Kucukelbir, and McAuliffe (2017) for an exposition of the theory of the variational method). The mean field variational inference uses the following approximation: \[p(z,\theta,\phi|w,\alpha,\beta)\simeq q(z,\theta,\phi)=q(z)q(\theta)q(\phi)\]

From Bishop (2006), [p. 466], we have: \[q^{*}(z)\propto exp\left\{ E_{\theta,\phi}\left[log(p(z|\theta))+log(p(w|\phi,z))\right]\right\}\]

\[q^{*}(\theta)\propto exp\left\{ E_{z,\phi}\left[log(p(\theta|\alpha))+log(p(z|\theta))\right]\right\}\]

\[q^{*}(\phi)\propto exp\left\{ E_{\theta,z}\left[log(p(\phi|\beta))+log(p(w|\phi,z))\right]\right\}\]

Using the expressions above, we have:

\[log(q^{*}(z)) \propto E_{\theta,\phi}\left[\sum_{d=1}^{D}\sum_{v=1}^{V}\sum_{k=1}^{K}n_{d,v}*z_{d,v,k}\left(log(\theta_{d,k})+log(\phi_{k,v})\right)\right]\] \[\propto \sum_{d=1}^{D}\sum_{v=1}^{V}\sum_{k=1}^{K}n_{d,v}*z_{d,v,k}\left(E(log(\theta_{d,k}))+E(log(\phi_{k,v}))\right)\]

Note that \[x|p\sim Multinomial_{K}(p)\iff log\left(p(x|p)\right)=\sum_{k=1}^{K}x_{k}log(p_{k})\], and let’s define \(log(p_{k})=E(log(\theta_{d,k})+E(log(\phi_{k,v}))\), so \(p_{k}=exp(E(log(\theta_{d,k}))+E(log(\phi_{k,v})))\). Thus,

\[q^{*}(z)\propto\prod_{d=1}^{D}\prod_{v=1}^{V}\prod_{k=1}^{K}\left[exp(E(log(\theta_{d,k}))+E(log(\phi_{k,v})))\right]^{n_{d,v}*z_{d,v,k}}\]

That is, \[z_{d,v}|w_{d},\theta_{d},\phi_{k}\sim Multinomial_{K}(p_{k})\]

and by the multinomial properties,\(E(z_{d,v,k})=p_{k}=exp(E(log(\theta_{d,k}))+E(log(\phi_{k,v})))\)

\[q^{*}(\theta) \propto exp\left\{ E_{z}\left[\sum_{d}\sum_{k}(\alpha-1)log(\theta_{d,k})+\sum_{d}\sum_{k}\sum_{v}n_{d,v}*z_{d,v,k}log(\theta_{d,k})\right]\right\}\] \[= \prod_{d}^{D}\prod_{k=1}^{K}exp\left\{ (\alpha+\sum_{v=1}^{V}n_{d,v}E(z_{d,v,k})-1)log(\theta_{d,k})\right\}\] \[= \prod_{d=1}^{D}\prod_{k=1}^{K}\theta_{d,k}^{\alpha+\sum_{v=1}^{V}n_{d,v}E(z_{d,v,k})-1}\]

Thus, the approximate posterior distribution of the topics distribution in a document d is:

\[\theta_{d}|w_{d},\alpha=Dirichlet_{K}(\tilde{\alpha}_{d})\] where \(\tilde{\alpha}_{d}=\alpha+\sum_{v=1}^{V}n_{d,v}E(z_{d,v,.})\). Note that \(\tilde{\alpha}_{d}\) is a vector of K dimension.

By the properties of the Dirichlet distribution, the expected value of \(\theta_{d}|\tilde{\alpha}_{d}\) is given by:

\[E(\theta_{d}|\tilde{\alpha_{d}})=\frac{\alpha+\sum_{v=1}^{V}n_{d,v}E(z_{d,v,.})}{\sum_{k=1}^{K}[\alpha+\sum_{v=1}^{V}E(z_{d,v,k})]}\]

The numerical estimation of \(E(\theta_{d}|\tilde{\alpha}_{d})\) gives the estimates of the topics proportions within each document \(d\), \((\hat\theta_{d})\). It is worth noting that \(E(z_{d,v,k})\) can be interpreted as the responsibility that topic \(k\) takes for explaining the observation of the word v in document d. Ignoring for a moment the denominator of equation above, \(E(\theta_{d,k}|\tilde{\alpha}_{d,k})\) is similar to a regression equation where \(n_{d,v}\) are the observed counts of words in document \(d\), and \(E(z_{d,v,k})\) are the parameter estimates (or weight) of the words. That illustrates that the importance of a topic in a document is due to the high presence of words \((n_{d,v})\) referring to that topic, and the weight of these words \((E(z_{d,v,k}))\).

Similarly,

\[q^{*}(\phi) \propto exp\left\{ E_{z}\left[\sum_{k=1}^{K}\sum_{v=1}^{V}(\beta-1)log(\phi_{k,v})+\sum_{d=1}^{D}\sum_{k=1}^{K}\sum_{v=1}^{V}n_{d,v}*z_{d,v,k}log(\phi_{k,v})\right]\right\}\] \[= \prod_{k=1}^{K}\prod_{v=1}^{V}exp\left\{ (\beta+\sum_{d=1}^{D}n_{d,v}*E(z_{d,v,k})-1)log(\phi_{k,v})\right\}\] \[= \prod_{k=1}^{K}\prod_{v=1}^{V}\phi_{k,v}^{\beta+\sum_{d=1}^{D}n_{d,v}*E(z_{d,v,k})}\]

Thus, the approximate posterior distribution of the words distribution in a topic \(\hat\phi_{k}\) is:

\(\phi_{k}|w,\beta\sim Dirichlet_{V}(\tilde{\beta_{k}})\)

where \(\tilde{\beta_{k}}=\beta+\sum_{d=1}^{D}n_{d,v}*E(z_{d,.,k})\). Note that \(\tilde{\beta}_{k}\) is a vector of V dimension.

And the expected value of \(\phi_{k}|\tilde{\beta}_{k}\) is given by:

\[ E(\phi_{k}|\tilde{\beta_{k}})=\frac{\beta+\sum_{d=1}^{D}n_{d,v}*E(z_{d,.,k})}{\sum_{v=1}^{V}(\beta+\sum_{d=1}^{D}n_{d,v}*E(z_{d,v,k}))} \]

The numerical estimation of \(E(\phi_{k}|\tilde{\beta}_{k})\) gives the estimates of the words relative importance for each topic \(k\), \((\phi_{k})\). Ignoring the denominator in the equation above, \(E(\phi_{k,v}|\tilde{\beta_{k,v}})\) is the weighted sum of the the frequencies of the word \(v\) in each of the documents \((n_{d,v})\), the weights being the responsibility topic \(k\) takes for explaining the observation of the word \(v\) in document \(d\) \((E(z_{d,v,k}))\).

Here, we have derived the posteriors expected values of the \(\theta\)s and \(\phi\)s using the words counts \(n_{d,v}\), which is slightly different from David M. Blei, Ng, and Jordan (2003). Posterior formulae similar to the current derived solution can be found in Murphy (2012), p. 962.

In sum, the rows of \(\phi_{K,V}=\left[E(\phi_{k}|\tilde{\beta}_{k})\right]_{K,V}\) are useful for interpreting (or identifying) the themes, which relative importance in each document are represented by the columns of \(\theta_{D,K}=\left[E(\theta_{d}|\tilde{\alpha}_{d})\right]_{D,K}\).

Practical tools for estimating the topics distributions of a corpus exist (see Grun and Hornik (2011); Silge and Robinson (2017 Chap. 6)).

References

Bishop, Christopher M. 2006. Pattern Recognition and Machine Learning. springer.

Blei, David M, Alp Kucukelbir, and Jon D McAuliffe. 2017. “Variational Inference: A Review for Statisticians.” Journal of the American Statistical Association, no. just-accepted. Taylor & Francis.

Blei, David M. 2012. “Probabilistic Topic Models.” Commun. ACM 55 (4). New York, NY, USA: ACM: 77–84. doi:10.1145/2133806.2133826.

Blei, David M., Andrew Y. Ng, and Michael I. Jordan. 2003. “Latent Dirichlet Allocation.” J. Mach. Learn. Res. 3 (March). JMLR.org: 993–1022. http://dl.acm.org/citation.cfm?id=944919.944937.

Grun, Bettina, and Kurt Hornik. 2011. “Topicmodels: An R Package for Fitting Topic Models.” Journal of Statistical Software, Articles 40 (13): 1–30. doi:10.18637/jss.v040.i13.

Murphy, Kevin P. 2012. Machine Learning: A Probabilistic Perspective. MIT press.

Silge, J., and D. Robinson. 2017. Text Mining with R: A Tidy Approach. O’Reilly Media, Incorporated. https://books.google.com/books?id=7bQzMQAACAAJ.