Understanding Topic Models from Multivariate OLS Perspective: a Gentle Technical Survey

Salfo Bikienga

Abstract

Topic Modeling (TM) is a text data dimension reduction algorithm, akin to factor analysis (FA) or principal component analysis (PCA), widely used for text data analysis (classification, clustering, etc.). Modern TM algorithms such as Latent Dirichlet Allocation (LDA) are probabilistic and complex, impeding their intuitive understanding. However, relating them to Non-Negative Matrix Factorization (NMF), and PCA mitigates this impediment. Indeed, parallel to being analogous to NMF, LDA also emerges from Principal Component Analysis (PCA), both of which are intuitively easy to understand. Therefore, presenting LDA as emerging from NMF and/or PCA provides an intuitive grounding of modern TM algorithms.

Type

Work in progress

Publication

Date

December, 2018

Links

PDF