<?xml version="1.0" encoding="utf-8" standalone="yes" ?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>Lda on Salfo Bikienga</title>
    <link>/categories/lda/</link>
    <description>Recent content in Lda on Salfo Bikienga</description>
    <generator>Hugo -- gohugo.io</generator>
    <language>en-us</language>
    <copyright>&amp;copy; 2017 Salfo Bikienga</copyright>
    <lastBuildDate>Sat, 11 Nov 2017 00:00:00 +0000</lastBuildDate>
    <atom:link href="/categories/lda/" rel="self" type="application/rss+xml" />
    
    <item>
      <title>Topic Modeling: An Application</title>
      <link>/post/topic-modeling-an-application/</link>
      <pubDate>Sat, 11 Nov 2017 00:00:00 +0000</pubDate>
      
      <guid>/post/topic-modeling-an-application/</guid>
      <description>&lt;section id=&#34;introduction&#34; class=&#34;level1&#34;&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;My work involves the use and the development of topic modeling algorithms. A surprising challenge I have had is communicating the output of topic modeling algorithms to people not familiar with text analytics. Here is my 10 cents explanation of the LDA output to my econ friends.&lt;/p&gt;
&lt;p&gt;The use of text data for &lt;a href=&#34;http://review.chicagobooth.edu/magazine/spring-2015/why-words-are-the-new-numbers&#34; target=&#34;_blank&#34;&gt;economic analysis&lt;/a&gt; is gaining attractions. One popular analytical tool is Latent Dirichlet Allocation (LDA), also called topic modeling &lt;span class=&#34;citation&#34; data-cites=&#34;Blei2003&#34;&gt;(Blei, Ng, and Jordan 2003)&lt;/span&gt;. Succinctly put, topic modeling consists of collapsing a matrix (i.e a spreadsheet) of words counts into a reduced matrix of topics’ proportions within documents. For instance, assume we have a collection of 500 documents, each containing 2000 unique words; this collection of documents (called corpus) can be represented as a dataset of 500 observations and 2000 variables (each word being a variable). Each cell in the matrix represents the count of a word in a document. The matrix is just a regular spreadsheet of data. Clearly, it is almost impossible to draw any insight from that many variables. LDA allows us to collapse the high dimensional dataset into a lower dimension, say a dimension of 10. With 10 variables, there is a hope that some insight can be drawn from the data. Following is a demonstration of LDA.&lt;/p&gt;
&lt;/section&gt;
&lt;section id=&#34;example-data&#34; class=&#34;level1&#34;&gt;
&lt;h1&gt;Example Data&lt;/h1&gt;
&lt;p&gt;Let’s consider a dataset of U.S. governors’ State of the State Addresses (SoSA). In most states, the governor gives a speech, generally in January, in which he/she lays out his/her priorities for the next fiscal year. Part of the goal of the speech is to explain (or justify) the proposed budget, and hopefully convince the state stakeholders to support the proposed budget. A budget proposal usually involves a reallocation of the state resources, which implies cuts and increases in different lines of the state budget. I collected 596 speeches from governors of the 50 states, spanning from 2001 to 2013.&lt;/p&gt;
&lt;p&gt;It is customary in text analytics to delete words that we believe are not “discriminative”. For instance link words such as “the”, “and”, “she”, etc. will not distinguish a Democrat from a Republican. We call this process, pre-processing the data, that is, cleaning the data by removing elements in the texts that we believe are not useful for our analysis.&lt;/p&gt;
&lt;p&gt;After pre-processing the data, I am left with a dataset of 596 observations and 1034 words (or variables). You can take a look at the pre-processed data &lt;a href=&#34;http://rpubs.com/sbikienga/334137&#34; target=&#34;_blank&#34;&gt;here&lt;/a&gt;, or you can download it &lt;a href=&#34;https://github.com/Salfo/States-Addresses/raw/master/data/SoSA_data_df.RData&#34; target=&#34;_blank&#34;&gt;here&lt;/a&gt;. Stemming, that is stripping the words to their roots, is often done to avoid counting related words separately. For example, education, educational, educate are stemmed and become educ.&lt;/p&gt;
&lt;/section&gt;
&lt;section id=&#34;example-application-of-lda&#34; class=&#34;level1&#34;&gt;
&lt;h1&gt;Example application of LDA&lt;/h1&gt;
&lt;p&gt;The goal when using LDA is primarily to reduce the dimension of a counts dataset. The hope is that the reduced dimension preserves the essential information contained in the original dataset. Interestingly, the reduced dimension is often more appropriate for statistical analysis, as it “solves” the overfitting problem associated with high dimensional data. Generally, the overfitting problem arises in situations where &lt;span class=&#34;math inline&#34;&gt;\(n\)&lt;/span&gt;, the number of observations, is not big enough to provide reliable estimates of the &lt;span class=&#34;math inline&#34;&gt;\(p\)&lt;/span&gt; variables’ parameters.&lt;/p&gt;
&lt;p&gt;There are several packages in R to implement the LDA model (&lt;code&gt;lda&lt;/code&gt;, &lt;code&gt;mallet&lt;/code&gt;, and &lt;code&gt;topicmodels&lt;/code&gt;). Here I will use the &lt;code&gt;topicmodels&lt;/code&gt; package as an example.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# install.packages(&amp;quot;topicmodels&amp;quot;) # You should run this code once if you don&amp;#39;t have topicmodels installed
library(topicmodels) # Load the topicmodels package
url &amp;lt;- url(&amp;quot;https://github.com/Salfo/States-Addresses/raw/master/data/SoSA_data_df.RData&amp;quot;)
load(url) # Load the data from the url provided
SoSA_topics &amp;lt;- LDA(SoSA_data_df, # The matrix of words counts
                   k = 2, # The number of topics to construct
                   method = &amp;quot;Gibbs&amp;quot;, # Estimation method
                   control = list(iter = 3000, # Number of iterations
                                  burnin = 1000, # Thow out the first 1000 estimates
                                  seed = 123)) # To get a reproducible results&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Note that LDA is a matrix factorization algorithm, and a matrix factorization consists of decomposing a matrix into the product of two or more matrices. Intuitively, we can write: &lt;span class=&#34;math display&#34;&gt;\[W_{D,V} \simeq \theta_{D,V}\phi_{K,V}\]&lt;/span&gt;&lt;/p&gt;
&lt;section id=&#34;the-reduced-dimension-theta-matrix&#34; class=&#34;level2&#34;&gt;
&lt;h2&gt;The reduced dimension, &lt;span class=&#34;math inline&#34;&gt;\(\theta\)&lt;/span&gt; matrix&lt;/h2&gt;
&lt;p&gt;In this example, &lt;span class=&#34;math inline&#34;&gt;\(D=596\)&lt;/span&gt;, &lt;span class=&#34;math inline&#34;&gt;\(V=1034\)&lt;/span&gt;, &lt;span class=&#34;math inline&#34;&gt;\(K=2\)&lt;/span&gt;. &lt;span class=&#34;math inline&#34;&gt;\(\theta\)&lt;/span&gt; contains the essential information needed to understand the variation between observations, concerning the speeches. For instance, &lt;span class=&#34;math inline&#34;&gt;\(\theta\)&lt;/span&gt; can be used to study how Democrats differ from Republicans regarding the relative importance of themes they cover in their speeches. &lt;span class=&#34;math inline&#34;&gt;\(\theta\)&lt;/span&gt; can be seen as a regular spreadsheet of data, as shown below. For an extended exposition of LDA, see &lt;a href=&#34;http://www.salfobikienga.rbind.io/post/introduction-to-lda/&#34; target=&#34;_blank&#34;&gt;this&lt;/a&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;theta_matrix &amp;lt;- posterior(SoSA_topics)$topics # Extract the theta matrix
theta_matrix &amp;lt;- round(as.data.frame(theta_matrix), digits = 3)
names(theta_matrix) &amp;lt;- paste(&amp;quot;Topic.&amp;quot;, 1:2, sep = &amp;quot;&amp;quot;) # Name the columns
head(theta_matrix, n = 10) # Print out the first 10 observations&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##                       Topic.1 Topic.2
## Alabama_2001_D_1.txt    0.274   0.726
## Alabama_2002_D_2.txt    0.377   0.623
## Alabama_2003_R_3.txt    0.767   0.233
## Alabama_2004_R_4.txt    0.613   0.387
## Alabama_2005_R_5.txt    0.484   0.516
## Alabama_2006_R_6.txt    0.513   0.487
## Alabama_2007_R_7.txt    0.424   0.576
## Alabama_2008_R_8.txt    0.481   0.519
## Alabama_2009_R_9.txt    0.516   0.484
## Alabama_2010_R_10.txt   0.583   0.417&lt;/code&gt;&lt;/pre&gt;
&lt;/section&gt;
&lt;section id=&#34;how-do-we-know-which-themes-are-covered&#34; class=&#34;level2&#34;&gt;
&lt;h2&gt;How do we know which themes are covered?&lt;/h2&gt;
&lt;p&gt;Well, here we imposed the number of themes by setting &lt;span class=&#34;math inline&#34;&gt;\(K=2\)&lt;/span&gt;. To identify the themes, we use the matrix &lt;span class=&#34;math inline&#34;&gt;\(\phi\)&lt;/span&gt;, which presents the relative importance of each word for each theme (or topic).&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;phi_matrix &amp;lt;- posterior(SoSA_topics)$terms # Extract the phi matrix
phi_matrix &amp;lt;- round(phi_matrix, 3) # Round the numbers to 3 decimals
phi_matrix[, 1:20] # Print out the first 20 words&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##    abil  abus academ acceler accept access accomplish accord account
## 1 0.001 0.001  0.000       0  0.001  0.000      0.000      0   0.002
## 2 0.000 0.001  0.001       0  0.000  0.003      0.001      0   0.001
##   achiev acknowledg across action activ actual addit address adequ
## 1  0.001      0.001  0.001  0.001 0.001  0.001 0.003   0.005     0
## 2  0.002      0.000  0.002  0.001 0.001  0.000 0.001   0.001     0
##   administr adopt
## 1     0.003 0.001
## 2     0.000 0.000&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;It might be more helpful to transpose the &lt;span class=&#34;math inline&#34;&gt;\(\phi\)&lt;/span&gt; so that by sorting each topic by decreasing order of the words relative weights we can identify the first few most important (in terms of weight) words for the given topic.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;T_phi_matrix &amp;lt;- as.data.frame(t(phi_matrix))
names(T_phi_matrix) &amp;lt;- paste(&amp;quot;Topic.&amp;quot;, 1:2)
T_phi_matrix[1:20, ] # Print out the first 20 words&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##            Topic. 1 Topic. 2
## abil          0.001    0.000
## abus          0.001    0.001
## academ        0.000    0.001
## acceler       0.000    0.000
## accept        0.001    0.000
## access        0.000    0.003
## accomplish    0.000    0.001
## accord        0.000    0.000
## account       0.002    0.001
## achiev        0.001    0.002
## acknowledg    0.001    0.000
## across        0.001    0.002
## action        0.001    0.001
## activ         0.001    0.001
## actual        0.001    0.000
## addit         0.003    0.001
## address       0.005    0.001
## adequ         0.000    0.000
## administr     0.003    0.000
## adopt         0.001    0.000&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The &lt;code&gt;terms()&lt;/code&gt; function of the &lt;code&gt;topicmodels&lt;/code&gt; package returns a convenient &lt;span class=&#34;math inline&#34;&gt;\(\phi\)&lt;/span&gt; matrix that replaces the words weights by the words themselves, after sorting each row of the &lt;span class=&#34;math inline&#34;&gt;\(\phi\)&lt;/span&gt; matrix.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;terms_matrix &amp;lt;- terms(SoSA_topics, 30) # Extract the first 30 most important words for each topic
terms_matrix[1:15, ] # Print out the first 15 words&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##       Topic 1   Topic 2   
##  [1,] &amp;quot;budget&amp;quot;  &amp;quot;school&amp;quot;  
##  [2,] &amp;quot;fund&amp;quot;    &amp;quot;work&amp;quot;    
##  [3,] &amp;quot;govern&amp;quot;  &amp;quot;educ&amp;quot;    
##  [4,] &amp;quot;peopl&amp;quot;   &amp;quot;help&amp;quot;    
##  [5,] &amp;quot;million&amp;quot; &amp;quot;children&amp;quot;
##  [6,] &amp;quot;work&amp;quot;    &amp;quot;make&amp;quot;    
##  [7,] &amp;quot;make&amp;quot;    &amp;quot;famili&amp;quot;  
##  [8,] &amp;quot;public&amp;quot;  &amp;quot;nation&amp;quot;  
##  [9,] &amp;quot;propos&amp;quot;  &amp;quot;busi&amp;quot;    
## [10,] &amp;quot;servic&amp;quot;  &amp;quot;creat&amp;quot;   
## [11,] &amp;quot;dollar&amp;quot;  &amp;quot;health&amp;quot;  
## [12,] &amp;quot;know&amp;quot;    &amp;quot;student&amp;quot; 
## [13,] &amp;quot;spend&amp;quot;   &amp;quot;invest&amp;quot;  
## [14,] &amp;quot;increas&amp;quot; &amp;quot;teacher&amp;quot; 
## [15,] &amp;quot;program&amp;quot; &amp;quot;care&amp;quot;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;By exploring the most important words for each topic, it seems reasonable to infer that Topic.1 is about “money”, the budget; and Topic.2 is mostly about education.&lt;/p&gt;
&lt;p&gt;In sum, &lt;span class=&#34;math inline&#34;&gt;\(\theta\)&lt;/span&gt; provides the essential information needed to understand variations or differences between observations; &lt;span class=&#34;math inline&#34;&gt;\(\phi\)&lt;/span&gt; is used to infer the meaning of each of the &lt;span class=&#34;math inline&#34;&gt;\(K\)&lt;/span&gt; columns of &lt;span class=&#34;math inline&#34;&gt;\(\theta\)&lt;/span&gt;.&lt;/p&gt;
&lt;/section&gt;
&lt;/section&gt;
&lt;section id=&#34;using-theta-for-statistical-analysis&#34; class=&#34;level1&#34;&gt;
&lt;h1&gt;Using &lt;span class=&#34;math inline&#34;&gt;\(\theta\)&lt;/span&gt; for statistical analysis&lt;/h1&gt;
&lt;p&gt;Of what uses can we make of &lt;span class=&#34;math inline&#34;&gt;\(\theta\)&lt;/span&gt;? Quite a lot!&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;math inline&#34;&gt;\(\theta\)&lt;/span&gt; alone, or combined with other control variables, &lt;span class=&#34;math inline&#34;&gt;\(X\)&lt;/span&gt;, can be used for regular statistical analysis. &lt;span class=&#34;math inline&#34;&gt;\(\theta\)&lt;/span&gt; has been used for economic analyses. &lt;span class=&#34;citation&#34; data-cites=&#34;Brown2016&#34;&gt;(Brown, Crowley, and Elliott 2016)&lt;/span&gt; applied LDA to assess whether the thematic content of financial statement disclosures is informative in predicting intentional misreporting. &lt;span class=&#34;citation&#34; data-cites=&#34;Hansen2016&#34;&gt;(Hansen and McMahon 2016)&lt;/span&gt; uses LDA in a Factor Augmented Vector Autoregressive modeling framework. I have a working paper exploring the relationship between US governors commitments to their economic agenda as stated in their public statements and the expansion of business establishments in their states &lt;span class=&#34;citation&#34; data-cites=&#34;Bikienga2017&#34;&gt;(Bikienga 2017)&lt;/span&gt;. For a survey of the use of LDA and other text analytics tools in economics, see &lt;span class=&#34;citation&#34; data-cites=&#34;Gentzkow2017&#34;&gt;(Gentzkow, Kelly, and Taddy 2017)&lt;/span&gt;.&lt;/p&gt;
&lt;/section&gt;
&lt;section id=&#34;illustration-of-the-use-of-theta&#34; class=&#34;level1&#34;&gt;
&lt;h1&gt;Illustration of the use of &lt;span class=&#34;math inline&#34;&gt;\(\theta\)&lt;/span&gt;&lt;/h1&gt;
&lt;p&gt;Is there any difference between Democrats and Republicans based on the themes covered in their speeches? To answer this question, we can compute the mean values of the topics by party line. Note that D, R, or I is appended to the rownames of the &lt;span class=&#34;math inline&#34;&gt;\(\theta\)&lt;/span&gt; shown above. They stand for Democrat, Republican, or Independent.&lt;/p&gt;
&lt;p&gt;Here, I am using the rownames to construct additional variables (&lt;code&gt;state&lt;/code&gt;, &lt;code&gt;party&lt;/code&gt;, and &lt;code&gt;year&lt;/code&gt;)&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(stringr)
state_vars &amp;lt;- row.names(theta_matrix) %&amp;gt;% 
  str_split(pattern = &amp;quot;_&amp;quot;) %&amp;gt;% as.data.frame() %&amp;gt;% t()
state_vars &amp;lt;- state_vars[, -4]
state_vars &amp;lt;- data.frame(state_vars)
names(state_vars) &amp;lt;- c(&amp;quot;state&amp;quot;, &amp;quot;year&amp;quot;, &amp;quot;party&amp;quot;)
df &amp;lt;- data.frame(theta_matrix, state_vars)
n_obs &amp;lt;- sample(1:596, size = 10)
sample_obs &amp;lt;- df[n_obs,]
sample_obs&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##                            Topic.1 Topic.2       state year party
## Florida_2009_R_94.txt        0.381   0.619     Florida 2009     R
## Kansas_2009_D_171.txt        0.422   0.578      Kansas 2009     D
## Maryland_2003_R_204.txt      0.435   0.565    Maryland 2003     R
## Illinois_2010_D_139.txt      0.579   0.421    Illinois 2010     D
## SouthDakota_2007_R_405.txt   0.378   0.622 SouthDakota 2007     R
## Tennessee_2002_R_411.txt     0.399   0.601   Tennessee 2002     R
## Florida_2004_R_89.txt        0.217   0.783     Florida 2004     R
## RhodeIsland_2002_R_534.txt   0.375   0.625 RhodeIsland 2002     R
## Alabama_2003_R_3.txt         0.767   0.233     Alabama 2003     R
## Minnesota_2008_R_241.txt     0.387   0.613   Minnesota 2008     R&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Compute the topics’ means by party line.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(dplyr)
library(tidyr)
df_by_party &amp;lt;- df %&amp;gt;%
  group_by(party) %&amp;gt;%
summarise(Topic.1 = mean(Topic.1), Topic.2 = mean(Topic.2)) %&amp;gt;%
  gather(Topic, Topic_proportion, Topic.1:Topic.2) %&amp;gt;%
  mutate(Topic_proportion = round(100*Topic_proportion, 0)) %&amp;gt;%
  mutate(pos = c(rep(75, 3), rep(25, 3)))
df_by_party&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## # A tibble: 6 x 4
##   party Topic   Topic_proportion   pos
##   &amp;lt;fct&amp;gt; &amp;lt;chr&amp;gt;              &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt;
## 1 D     Topic.1              46.   75.
## 2 I     Topic.1              62.   75.
## 3 R     Topic.1              51.   75.
## 4 D     Topic.2              54.   25.
## 5 I     Topic.2              38.   25.
## 6 R     Topic.2              49.   25.&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Democrats seem to talk more about education (Topic.2) than Republicans. On average, about 54% of their speeches refers to the education theme, against 49% for Republicans. Conversely, Republicans tend to talk more about budgetary issues than Democrats (51% for Republicans vs. 46% for Democrats).&lt;/p&gt;
&lt;p&gt;Clearly, these differences are not huge, and we cannot put too much stock into it. The goal here is to illustrate how one may use the topics distributions, without going into the intricacies of statistical significance.&lt;/p&gt;
&lt;p&gt;The above table can be visualized with the help of a stacked bar plot.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(ggplot2)
library(ggthemes)
library(extrafont)
#library(plyr)
#library(scales)
fill &amp;lt;- c(&amp;quot;#add8e6&amp;quot;, &amp;quot;#b87333&amp;quot;)
p_party &amp;lt;- ggplot() +
  geom_bar(aes(y = Topic_proportion, x = party, fill = Topic), 
           data = df_by_party, stat=&amp;quot;identity&amp;quot;) +
  geom_text(data=df_by_party, aes(x = party, y = pos, label = paste0(Topic_proportion,&amp;quot;%&amp;quot;)),
            colour=&amp;quot;black&amp;quot;, family=&amp;quot;Tahoma&amp;quot;, size=4) +
  theme(legend.position=&amp;quot;bottom&amp;quot;, legend.direction=&amp;quot;horizontal&amp;quot;,
        legend.title = element_blank()) +
  labs(x=&amp;quot;Political Party&amp;quot;, y=&amp;quot;Percentage&amp;quot;) +
  ggtitle(&amp;quot;Average Proportion of Topic Covered By Party (%)&amp;quot;) +
  scale_fill_manual(values=fill) +
  theme(axis.line = element_line(size=1, colour = &amp;quot;black&amp;quot;),
        panel.grid.major = element_line(colour = &amp;quot;#d3d3d3&amp;quot;), panel.grid.minor = element_blank(),
        panel.border = element_blank(), panel.background = element_blank()) +
  theme(plot.title = element_text(size = 14, family = &amp;quot;Tahoma&amp;quot;, face = &amp;quot;bold&amp;quot;),
        text=element_text(family=&amp;quot;Tahoma&amp;quot;),
        axis.text.x=element_text(colour=&amp;quot;black&amp;quot;, size = 10),
        axis.text.y=element_text(colour=&amp;quot;black&amp;quot;, size = 10))
p_party&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;/post/2017-11-11-topic-modeling-an-application_files/figure-html/unnamed-chunk-10-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;/section&gt;
&lt;section id=&#34;should-we-trust-the-results&#34; class=&#34;level1&#34;&gt;
&lt;h1&gt;Should we trust the results?&lt;/h1&gt;
&lt;p&gt;Yes! We should. A mental block I faced when I started exploring topic modeling is trusting the results. If your program is like mine, latent variables models are not covered in your econometrics classes, even though they are widely used in the economics literature. In Macroeconomics, they are termed Factor Augmented Vector Autoregressive models. In Development Economics, they are used to construct indices &lt;span class=&#34;citation&#34; data-cites=&#34;Berenger2007&#34;&gt;(Bérenger and Verdier-Chouchane 2007, &lt;span class=&#34;citation&#34; data-cites=&#34;Tabellini2010&#34;&gt;@Tabellini2010&lt;/span&gt;)&lt;/span&gt;. Factor models approaches are also used as instruments &lt;span class=&#34;citation&#34; data-cites=&#34;Bai2010&#34;&gt;(Bai and Ng 2010)&lt;/span&gt;.&lt;/p&gt;
&lt;p&gt;But, LDA is just another factor model algorithm. It is closely related to principal component analysis (PCA). In the future, I will present the idea of factor models, and why they are “reliable”.&lt;/p&gt;
&lt;p&gt;#Conclusion&lt;/p&gt;
&lt;p&gt;In sum, topic modeling in general and LDA in particular is a dimension reduction method. It consists of collapsing a matrix of words counts into a reduced matrix of topics distributions. This illustration provides a sense of its usefulness for statistical analysis.&lt;/p&gt;
&lt;/section&gt;
&lt;section id=&#34;references&#34; class=&#34;level1 unnumbered&#34;&gt;
&lt;h1&gt;References&lt;/h1&gt;
&lt;div id=&#34;refs&#34; class=&#34;references&#34;&gt;
&lt;div id=&#34;ref-Bai2010&#34;&gt;
&lt;p&gt;Bai, Jushan, and Serena Ng. 2010. “Instrumental Variable Estimation in a Data Rich Environment.” &lt;em&gt;Econometric Theory&lt;/em&gt; 26 (6). Cambridge University Press: 1577–1606.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;ref-Berenger2007&#34;&gt;
&lt;p&gt;Bérenger, Valérie, and Audrey Verdier-Chouchane. 2007. “Multidimensional Measures of Well-Being: Standard of Living and Quality of Life Across Countries.” &lt;em&gt;World Development&lt;/em&gt; 35 (7). Elsevier: 1259–76.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;ref-Bikienga2017&#34;&gt;
&lt;p&gt;Bikienga, Salfo. 2017. “The Governor as the Entrepreneur in Chief: An Exploratory Analysis.”&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;ref-Blei2003&#34;&gt;
&lt;p&gt;Blei, David M., Andrew Y. Ng, and Michael I. Jordan. 2003. “Latent Dirichlet Allocation.” &lt;em&gt;J. Mach. Learn. Res.&lt;/em&gt; 3 (March). JMLR.org: 993–1022. &lt;a href=&#34;http://dl.acm.org/citation.cfm?id=944919.944937&#34; class=&#34;uri&#34;&gt;http://dl.acm.org/citation.cfm?id=944919.944937&lt;/a&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;ref-Brown2016&#34;&gt;
&lt;p&gt;Brown, Nerissa C, Richard M Crowley, and W Brooke Elliott. 2016. “What Are You Saying? Using Topic to Detect Financial Misreporting.”&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;ref-Gentzkow2017&#34;&gt;
&lt;p&gt;Gentzkow, Matthew, Bryan T Kelly, and Matt Taddy. 2017. “Text as Data.” National Bureau of Economic Research.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;ref-Hansen2016&#34;&gt;
&lt;p&gt;Hansen, Stephen, and Michael McMahon. 2016. “Shocking Language: Understanding the Macroeconomic Effects of Central Bank Communication.” &lt;em&gt;Journal of International Economics&lt;/em&gt; 99. Elsevier: S114–S133.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;ref-Tabellini2010&#34;&gt;
&lt;p&gt;Tabellini, Guido. 2010. “Culture and Institutions: Economic Development in the Regions of Europe.” &lt;em&gt;Journal of the European Economic Association&lt;/em&gt; 8 (4). Oxford University Press: 677–716.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/section&gt;
</description>
    </item>
    
    <item>
      <title>Topic Modeling: An Application</title>
      <link>/post/topic-modeling-an-application/</link>
      <pubDate>Sat, 11 Nov 2017 00:00:00 +0000</pubDate>
      
      <guid>/post/topic-modeling-an-application/</guid>
      <description>&lt;div id=&#34;introduction&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;My work involves the use and the development of topic modeling algorithms. A surprising challenge I have had is communicating the output of topic modeling algorithms to people not familiar with text analytics. Here is my 10 cents explanation of the LDA output to my econ friends.&lt;/p&gt;
&lt;p&gt;The use of text data for &lt;a href=&#34;http://review.chicagobooth.edu/magazine/spring-2015/why-words-are-the-new-numbers&#34; target=&#34;_blank&#34;&gt;economic analysis&lt;/a&gt; is gaining attractions. One popular analytical tool is Latent Dirichlet Allocation (LDA), also called topic modeling &lt;span class=&#34;citation&#34;&gt;(Blei, Ng, and Jordan 2003)&lt;/span&gt;. Succinctly put, topic modeling consists of collapsing a matrix (i.e a spreadsheet) of words counts into a reduced matrix of topics’ proportions within documents. For instance, assume we have a collection of 500 documents, each containing 2000 unique words; this collection of documents (called corpus) can be represented as a dataset of 500 observations and 2000 variables (each word being a variable). Each cell in the matrix represents the count of a word in a document. The matrix is just a regular spreadsheet of data. Clearly, it is almost impossible to draw any insight from that many variables. LDA allows us to collapse the high dimensional dataset into a lower dimension, say a dimension of 10. With 10 variables, there is a hope that some insight can be drawn from the data. Following is a demonstration of LDA.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;example-data&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Example Data&lt;/h1&gt;
&lt;p&gt;Let’s consider a dataset of U.S. governors’ State of the State Addresses (SoSA). In most states, the governor gives a speech, generally in January, in which he/she lays out his/her priorities for the next fiscal year. Part of the goal of the speech is to explain (or justify) the proposed budget, and hopefully convince the state stakeholders to support the proposed budget. A budget proposal usually involves a reallocation of the state resources, which implies cuts and increases in different lines of the state budget. I collected 596 speeches from governors of the 50 states, spanning from 2001 to 2013.&lt;/p&gt;
&lt;p&gt;It is customary in text analytics to delete words that we believe are not “discriminative”. For instance link words such as “the”, “and”, “she”, etc. will not distinguish a Democrat from a Republican. We call this process, pre-processing the data, that is, cleaning the data by removing elements in the texts that we believe are not useful for our analysis.&lt;/p&gt;
&lt;p&gt;After pre-processing the data, I am left with a dataset of 596 observations and 1034 words (or variables). You can take a look at the pre-processed data &lt;a href=&#34;http://rpubs.com/sbikienga/334137&#34; target=&#34;_blank&#34;&gt;here&lt;/a&gt;, or you can download it &lt;a href=&#34;https://github.com/Salfo/States-Addresses/raw/master/data/SoSA_data_df.RData&#34; target=&#34;_blank&#34;&gt;here&lt;/a&gt;. Stemming, that is stripping the words to their roots, is often done to avoid counting related words separately. For example, education, educational, educate are stemmed and become educ.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;example-application-of-lda&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Example application of LDA&lt;/h1&gt;
&lt;p&gt;The goal when using LDA is primarily to reduce the dimension of a counts dataset. The hope is that the reduced dimension preserves the essential information contained in the original dataset. Interestingly, the reduced dimension is often more appropriate for statistical analysis, as it “solves” the overfitting problem associated with high dimensional data. Generally, the overfitting problem arises in situations where &lt;span class=&#34;math inline&#34;&gt;\(n\)&lt;/span&gt;, the number of observations, is not big enough to provide reliable estimates of the &lt;span class=&#34;math inline&#34;&gt;\(p\)&lt;/span&gt; variables’ parameters.&lt;/p&gt;
&lt;p&gt;There are several packages in R to implement the LDA model (&lt;code&gt;lda&lt;/code&gt;, &lt;code&gt;mallet&lt;/code&gt;, and &lt;code&gt;topicmodels&lt;/code&gt;). Here I will use the &lt;code&gt;topicmodels&lt;/code&gt; package as an example.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# install.packages(&amp;quot;topicmodels&amp;quot;) # You should run this code once if you don&amp;#39;t have topicmodels installed
library(topicmodels) # Load the topicmodels package
url &amp;lt;- url(&amp;quot;https://github.com/Salfo/States-Addresses/raw/master/data/SoSA_data_df.RData&amp;quot;)
load(url) # Load the data from the url provided
SoSA_topics &amp;lt;- LDA(SoSA_data_df, # The matrix of words counts
                   k = 2, # The number of topics to construct
                   method = &amp;quot;Gibbs&amp;quot;, # Estimation method
                   control = list(iter = 3000, # Number of iterations
                                  burnin = 1000, # Thow out the first 1000 estimates
                                  seed = 123)) # To get a reproducible results&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Note that LDA is a matrix factorization algorithm, and a matrix factorization consists of decomposing a matrix into the product of two or more matrices. Intuitively, we can write: &lt;span class=&#34;math display&#34;&gt;\[W_{D,V} \simeq \theta_{D,V}\phi_{K,V}\]&lt;/span&gt;&lt;/p&gt;
&lt;div id=&#34;the-reduced-dimension-theta-matrix&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;The reduced dimension, &lt;span class=&#34;math inline&#34;&gt;\(\theta\)&lt;/span&gt; matrix&lt;/h2&gt;
&lt;p&gt;In this example, &lt;span class=&#34;math inline&#34;&gt;\(D=596\)&lt;/span&gt;, &lt;span class=&#34;math inline&#34;&gt;\(V=1034\)&lt;/span&gt;, &lt;span class=&#34;math inline&#34;&gt;\(K=2\)&lt;/span&gt;. &lt;span class=&#34;math inline&#34;&gt;\(\theta\)&lt;/span&gt; contains the essential information needed to understand the variation between observations, concerning the speeches. For instance, &lt;span class=&#34;math inline&#34;&gt;\(\theta\)&lt;/span&gt; can be used to study how Democrats differ from Republicans regarding the relative importance of themes they cover in their speeches. &lt;span class=&#34;math inline&#34;&gt;\(\theta\)&lt;/span&gt; can be seen as a regular spreadsheet of data, as shown below. For an extended exposition of LDA, see &lt;a href=&#34;http://www.salfobikienga.rbind.io/post/introduction-to-lda/&#34; target=&#34;_blank&#34;&gt;this&lt;/a&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;theta_matrix &amp;lt;- posterior(SoSA_topics)$topics # Extract the theta matrix
theta_matrix &amp;lt;- round(as.data.frame(theta_matrix), digits = 3)
names(theta_matrix) &amp;lt;- paste(&amp;quot;Topic.&amp;quot;, 1:2, sep = &amp;quot;&amp;quot;) # Name the columns
head(theta_matrix, n = 10) # Print out the first 10 observations&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##                       Topic.1 Topic.2
## Alabama_2001_D_1.txt    0.274   0.726
## Alabama_2002_D_2.txt    0.377   0.623
## Alabama_2003_R_3.txt    0.767   0.233
## Alabama_2004_R_4.txt    0.613   0.387
## Alabama_2005_R_5.txt    0.484   0.516
## Alabama_2006_R_6.txt    0.513   0.487
## Alabama_2007_R_7.txt    0.424   0.576
## Alabama_2008_R_8.txt    0.481   0.519
## Alabama_2009_R_9.txt    0.516   0.484
## Alabama_2010_R_10.txt   0.583   0.417&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;how-do-we-know-which-themes-are-covered&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;How do we know which themes are covered?&lt;/h2&gt;
&lt;p&gt;Well, here we imposed the number of themes by setting &lt;span class=&#34;math inline&#34;&gt;\(K=2\)&lt;/span&gt;. To identify the themes, we use the matrix &lt;span class=&#34;math inline&#34;&gt;\(\phi\)&lt;/span&gt;, which presents the relative importance of each word for each theme (or topic).&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;phi_matrix &amp;lt;- posterior(SoSA_topics)$terms # Extract the phi matrix
phi_matrix &amp;lt;- round(phi_matrix, 3) # Round the numbers to 3 decimals
phi_matrix[, 1:20] # Print out the first 20 words&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##    abil  abus academ acceler accept access accomplish accord account
## 1 0.001 0.001  0.000       0  0.001  0.000      0.000      0   0.002
## 2 0.000 0.001  0.001       0  0.000  0.003      0.001      0   0.001
##   achiev acknowledg across action activ actual addit address adequ
## 1  0.001      0.001  0.001  0.001 0.001  0.001 0.003   0.005     0
## 2  0.002      0.000  0.002  0.001 0.001  0.000 0.001   0.001     0
##   administr adopt
## 1     0.003 0.001
## 2     0.000 0.000&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;It might be more helpful to transpose the &lt;span class=&#34;math inline&#34;&gt;\(\phi\)&lt;/span&gt; so that by sorting each topic by decreasing order of the words relative weights we can identify the first few most important (in terms of weight) words for the given topic.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;T_phi_matrix &amp;lt;- as.data.frame(t(phi_matrix))
names(T_phi_matrix) &amp;lt;- paste(&amp;quot;Topic.&amp;quot;, 1:2)
T_phi_matrix[1:20, ] # Print out the first 20 words&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##            Topic. 1 Topic. 2
## abil          0.001    0.000
## abus          0.001    0.001
## academ        0.000    0.001
## acceler       0.000    0.000
## accept        0.001    0.000
## access        0.000    0.003
## accomplish    0.000    0.001
## accord        0.000    0.000
## account       0.002    0.001
## achiev        0.001    0.002
## acknowledg    0.001    0.000
## across        0.001    0.002
## action        0.001    0.001
## activ         0.001    0.001
## actual        0.001    0.000
## addit         0.003    0.001
## address       0.005    0.001
## adequ         0.000    0.000
## administr     0.003    0.000
## adopt         0.001    0.000&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The &lt;code&gt;terms()&lt;/code&gt; function of the &lt;code&gt;topicmodels&lt;/code&gt; package returns a convenient &lt;span class=&#34;math inline&#34;&gt;\(\phi\)&lt;/span&gt; matrix that replaces the words weights by the words themselves, after sorting each row of the &lt;span class=&#34;math inline&#34;&gt;\(\phi\)&lt;/span&gt; matrix.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;terms_matrix &amp;lt;- terms(SoSA_topics, 30) # Extract the first 30 most important words for each topic
terms_matrix[1:15, ] # Print out the first 15 words&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##       Topic 1   Topic 2   
##  [1,] &amp;quot;budget&amp;quot;  &amp;quot;school&amp;quot;  
##  [2,] &amp;quot;fund&amp;quot;    &amp;quot;work&amp;quot;    
##  [3,] &amp;quot;govern&amp;quot;  &amp;quot;educ&amp;quot;    
##  [4,] &amp;quot;peopl&amp;quot;   &amp;quot;help&amp;quot;    
##  [5,] &amp;quot;million&amp;quot; &amp;quot;children&amp;quot;
##  [6,] &amp;quot;work&amp;quot;    &amp;quot;make&amp;quot;    
##  [7,] &amp;quot;make&amp;quot;    &amp;quot;famili&amp;quot;  
##  [8,] &amp;quot;public&amp;quot;  &amp;quot;nation&amp;quot;  
##  [9,] &amp;quot;propos&amp;quot;  &amp;quot;busi&amp;quot;    
## [10,] &amp;quot;servic&amp;quot;  &amp;quot;creat&amp;quot;   
## [11,] &amp;quot;dollar&amp;quot;  &amp;quot;health&amp;quot;  
## [12,] &amp;quot;know&amp;quot;    &amp;quot;student&amp;quot; 
## [13,] &amp;quot;spend&amp;quot;   &amp;quot;invest&amp;quot;  
## [14,] &amp;quot;increas&amp;quot; &amp;quot;teacher&amp;quot; 
## [15,] &amp;quot;program&amp;quot; &amp;quot;care&amp;quot;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;By exploring the most important words for each topic, it seems reasonable to infer that Topic.1 is about “money”, the budget; and Topic.2 is mostly about education.&lt;/p&gt;
&lt;p&gt;In sum, &lt;span class=&#34;math inline&#34;&gt;\(\theta\)&lt;/span&gt; provides the essential information needed to understand variations or differences between observations; &lt;span class=&#34;math inline&#34;&gt;\(\phi\)&lt;/span&gt; is used to infer the meaning of each of the &lt;span class=&#34;math inline&#34;&gt;\(K\)&lt;/span&gt; columns of &lt;span class=&#34;math inline&#34;&gt;\(\theta\)&lt;/span&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;using-theta-for-statistical-analysis&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Using &lt;span class=&#34;math inline&#34;&gt;\(\theta\)&lt;/span&gt; for statistical analysis&lt;/h1&gt;
&lt;p&gt;Of what uses can we make of &lt;span class=&#34;math inline&#34;&gt;\(\theta\)&lt;/span&gt;? Quite a lot!&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;math inline&#34;&gt;\(\theta\)&lt;/span&gt; alone, or combined with other control variables, &lt;span class=&#34;math inline&#34;&gt;\(X\)&lt;/span&gt;, can be used for regular statistical analysis. &lt;span class=&#34;math inline&#34;&gt;\(\theta\)&lt;/span&gt; has been used for economic analyses. &lt;span class=&#34;citation&#34;&gt;(Brown, Crowley, and Elliott 2016)&lt;/span&gt; applied LDA to assess whether the thematic content of financial statement disclosures is informative in predicting intentional misreporting. &lt;span class=&#34;citation&#34;&gt;(Hansen and McMahon 2016)&lt;/span&gt; uses LDA in a Factor Augmented Vector Autoregressive modeling framework. I have a working paper exploring the relationship between US governors commitments to their economic agenda as stated in their public statements and the expansion of business establishments in their states &lt;span class=&#34;citation&#34;&gt;(Bikienga 2017)&lt;/span&gt;. For a survey of the use of LDA and other text analytics tools in economics, see &lt;span class=&#34;citation&#34;&gt;(Gentzkow, Kelly, and Taddy 2017)&lt;/span&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;illustration-of-the-use-of-theta&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Illustration of the use of &lt;span class=&#34;math inline&#34;&gt;\(\theta\)&lt;/span&gt;&lt;/h1&gt;
&lt;p&gt;Is there any difference between Democrats and Republicans based on the themes covered in their speeches? To answer this question, we can compute the mean values of the topics by party line. Note that D, R, or I is appended to the rownames of the &lt;span class=&#34;math inline&#34;&gt;\(\theta\)&lt;/span&gt; shown above. They stand for Democrat, Republican, or Independent.&lt;/p&gt;
&lt;p&gt;Here, I am using the rownames to construct additional variables (&lt;code&gt;state&lt;/code&gt;, &lt;code&gt;party&lt;/code&gt;, and &lt;code&gt;year&lt;/code&gt;)&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(stringr)
state_vars &amp;lt;- row.names(theta_matrix) %&amp;gt;% 
  str_split(pattern = &amp;quot;_&amp;quot;) %&amp;gt;% as.data.frame() %&amp;gt;% t()
state_vars &amp;lt;- state_vars[, -4]
state_vars &amp;lt;- data.frame(state_vars)
names(state_vars) &amp;lt;- c(&amp;quot;state&amp;quot;, &amp;quot;year&amp;quot;, &amp;quot;party&amp;quot;)
df &amp;lt;- data.frame(theta_matrix, state_vars)
n_obs &amp;lt;- sample(1:596, size = 10)
sample_obs &amp;lt;- df[n_obs,]
sample_obs&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##                             Topic.1 Topic.2        state year party
## Idaho_2008_R_126.txt          0.648   0.352        Idaho 2008     R
## NewJersey_2009_D_307.txt      0.477   0.523    NewJersey 2009     D
## NewHampshire_2007_D_295.txt   0.277   0.723 NewHampshire 2007     D
## Alabama_2005_R_5.txt          0.484   0.516      Alabama 2005     R
## Tennessee_2013_R_588.txt      0.669   0.331    Tennessee 2013     R
## Wyoming_2010_D_499.txt        0.795   0.205      Wyoming 2010     D
## Washington_2002_D_460.txt     0.446   0.554   Washington 2002     D
## Maine_2005_D_195.txt          0.344   0.656        Maine 2005     D
## Virginia_2011_R_458.txt       0.570   0.430     Virginia 2011     R
## California_2011_D_52.txt      0.679   0.321   California 2011     D&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Compute the topics’ means by party line.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(dplyr)
library(tidyr)
df_by_party &amp;lt;- df %&amp;gt;%
  group_by(party) %&amp;gt;%
summarise(Topic.1 = mean(Topic.1), Topic.2 = mean(Topic.2)) %&amp;gt;%
  gather(Topic, Topic_proportion, Topic.1:Topic.2) %&amp;gt;%
  mutate(Topic_proportion = round(100*Topic_proportion, 0)) %&amp;gt;%
  mutate(pos = c(rep(75, 3), rep(25, 3)))
df_by_party&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## # A tibble: 6 x 4
##   party Topic   Topic_proportion   pos
##   &amp;lt;fct&amp;gt; &amp;lt;chr&amp;gt;              &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt;
## 1 D     Topic.1               46    75
## 2 I     Topic.1               62    75
## 3 R     Topic.1               51    75
## 4 D     Topic.2               54    25
## 5 I     Topic.2               38    25
## 6 R     Topic.2               49    25&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Democrats seem to talk more about education (Topic.2) than Republicans. On average, about 54% of their speeches refers to the education theme, against 49% for Republicans. Conversely, Republicans tend to talk more about budgetary issues than Democrats (51% for Republicans vs. 46% for Democrats).&lt;/p&gt;
&lt;p&gt;Clearly, these differences are not huge, and we cannot put too much stock into it. The goal here is to illustrate how one may use the topics distributions, without going into the intricacies of statistical significance.&lt;/p&gt;
&lt;p&gt;The above table can be visualized with the help of a stacked bar plot.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(ggplot2)
library(ggthemes)
library(extrafont)
#library(plyr)
#library(scales)
fill &amp;lt;- c(&amp;quot;#add8e6&amp;quot;, &amp;quot;#b87333&amp;quot;)
p_party &amp;lt;- ggplot() +
  geom_bar(aes(y = Topic_proportion, x = party, fill = Topic), 
           data = df_by_party, stat=&amp;quot;identity&amp;quot;) +
  geom_text(data=df_by_party, aes(x = party, y = pos, label = paste0(Topic_proportion,&amp;quot;%&amp;quot;)),
            colour=&amp;quot;black&amp;quot;, family=&amp;quot;Tahoma&amp;quot;, size=4) +
  theme(legend.position=&amp;quot;bottom&amp;quot;, legend.direction=&amp;quot;horizontal&amp;quot;,
        legend.title = element_blank()) +
  labs(x=&amp;quot;Political Party&amp;quot;, y=&amp;quot;Percentage&amp;quot;) +
  ggtitle(&amp;quot;Average Proportion of Topic Covered By Party (%)&amp;quot;) +
  scale_fill_manual(values=fill) +
  theme(axis.line = element_line(size=1, colour = &amp;quot;black&amp;quot;),
        panel.grid.major = element_line(colour = &amp;quot;#d3d3d3&amp;quot;), panel.grid.minor = element_blank(),
        panel.border = element_blank(), panel.background = element_blank()) +
  theme(plot.title = element_text(size = 14, family = &amp;quot;Tahoma&amp;quot;, face = &amp;quot;bold&amp;quot;),
        text=element_text(family=&amp;quot;Tahoma&amp;quot;),
        axis.text.x=element_text(colour=&amp;quot;black&amp;quot;, size = 10),
        axis.text.y=element_text(colour=&amp;quot;black&amp;quot;, size = 10))
p_party&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;/post/2017-11-11-topic-modeling-an-application_files/figure-html/unnamed-chunk-10-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;should-we-trust-the-results&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Should we trust the results?&lt;/h1&gt;
&lt;p&gt;Yes! We should. A mental block I faced when I started exploring topic modeling is trusting the results. If your program is like mine, latent variables models are not covered in your econometrics classes, even though they are widely used in the economics literature. In Macroeconomics, they are termed Factor Augmented Vector Autoregressive models. In Development Economics, they are used to construct indices &lt;span class=&#34;citation&#34;&gt;(Bérenger and Verdier-Chouchane 2007; Tabellini 2010)&lt;/span&gt;. Factor models approaches are also used as instruments &lt;span class=&#34;citation&#34;&gt;(Bai and Ng 2010)&lt;/span&gt;.&lt;/p&gt;
&lt;p&gt;But, LDA is just another factor model algorithm. It is closely related to principal component analysis (PCA). In the future, I will present the idea of factor models, and why they are “reliable”.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;conclusion&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Conclusion&lt;/h1&gt;
&lt;p&gt;In sum, topic modeling in general and LDA in particular is a dimension reduction method. It consists of collapsing a matrix of words counts into a reduced matrix of topics distributions. This illustration provides a sense of its usefulness for statistical analysis.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;references&#34; class=&#34;section level1 unnumbered&#34;&gt;
&lt;h1&gt;References&lt;/h1&gt;
&lt;div id=&#34;refs&#34; class=&#34;references&#34;&gt;
&lt;div id=&#34;ref-Bai2010&#34;&gt;
&lt;p&gt;Bai, Jushan, and Serena Ng. 2010. “Instrumental Variable Estimation in a Data Rich Environment.” &lt;em&gt;Econometric Theory&lt;/em&gt; 26 (6). Cambridge University Press: 1577–1606.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;ref-Berenger2007&#34;&gt;
&lt;p&gt;Bérenger, Valérie, and Audrey Verdier-Chouchane. 2007. “Multidimensional Measures of Well-Being: Standard of Living and Quality of Life Across Countries.” &lt;em&gt;World Development&lt;/em&gt; 35 (7). Elsevier: 1259–76.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;ref-Bikienga2017&#34;&gt;
&lt;p&gt;Bikienga, Salfo. 2017. “The Governor as the Entrepreneur in Chief: An Exploratory Analysis.”&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;ref-Blei2003&#34;&gt;
&lt;p&gt;Blei, David M., Andrew Y. Ng, and Michael I. Jordan. 2003. “Latent Dirichlet Allocation.” &lt;em&gt;J. Mach. Learn. Res.&lt;/em&gt; 3 (March). JMLR.org: 993–1022. &lt;a href=&#34;http://dl.acm.org/citation.cfm?id=944919.944937&#34; class=&#34;uri&#34;&gt;http://dl.acm.org/citation.cfm?id=944919.944937&lt;/a&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;ref-Brown2016&#34;&gt;
&lt;p&gt;Brown, Nerissa C, Richard M Crowley, and W Brooke Elliott. 2016. “What Are You Saying? Using Topic to Detect Financial Misreporting.”&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;ref-Gentzkow2017&#34;&gt;
&lt;p&gt;Gentzkow, Matthew, Bryan T Kelly, and Matt Taddy. 2017. “Text as Data.” National Bureau of Economic Research.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;ref-Hansen2016&#34;&gt;
&lt;p&gt;Hansen, Stephen, and Michael McMahon. 2016. “Shocking Language: Understanding the Macroeconomic Effects of Central Bank Communication.” &lt;em&gt;Journal of International Economics&lt;/em&gt; 99. Elsevier: S114–S133.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;ref-Tabellini2010&#34;&gt;
&lt;p&gt;Tabellini, Guido. 2010. “Culture and Institutions: Economic Development in the Regions of Europe.” &lt;em&gt;Journal of the European Economic Association&lt;/em&gt; 8 (4). Oxford University Press: 677–716.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
</description>
    </item>
    
  </channel>
</rss>
