Dynamic topic model

Dynamic topic models are generative models that can be used to analyze the evolution of (unobserved) topics of a collection of documents over time. This family of models was proposed by David Blei and John Lafferty and is an extension to Latent Dirichlet Allocation (LDA) that can handle sequential documents.^[1]

In LDA, both the order the words appear in a document and the order the documents appear in the corpus are oblivious to the model. Whereas words are still assumed to be exchangeable, in a dynamic topic model the order of the documents plays a fundamental role. More precisely, the documents are grouped by time slice (e.g.: years) and it is assumed that the documents of each group come from a set of topics that evolved from the set of the previous slice.

Topics

Similarly to LDA and pLSA, in a dynamic topic model, each document is viewed as a mixture of unobserved topics. Furthermore, each topic defines a multinomial distribution over a set of terms. Thus, for each word of each document, a topic is drawn from the mixture and a term is subsequently drawn from the multinomial distribution corresponding to that topic.

The topics, however, evolve over time. For instance, the two most likely terms of a topic at time $t$ could be "network" and "Zipf" (in descending order) while the most likely ones at time $t+1$ could be "Zipf" and "percolation" (in descending order).

Model

Define

\alpha_t

as the per-document topic distribution at time t.

\beta_{t,k}

as the word distribution of topic k at time t.

\eta_{t,d}

as the topic distribution for document d in time t,

z_{t,d,n}

as the topic for the nth word in document d in time t, and

w_{t,d,n}

as the specific word.

In this model, the multinomial distributions $\alpha_{t+1}$ and $\beta_{t+1,k}$ are generated from $\alpha_t$ and $\beta_{t,k}$ , respectively. Even though multinomial distributions are usually written in terms of the mean parameters, representing them in terms of the natural parameters is better in the context of dynamic topic models.

The former representation has some disadvantages due to the fact that the parameters are constrained to be non-negative and sum to one.^[2] When defining the evolution of these distributions, one would need to assure that such constraints were satisfied. Since both distributions are in the exponential family, one solution to this problem is to represent them in terms of the natural parameters, that can assume any real value and can be individually changed.

Using the natural parameterization, the dynamics of the topic model are given by

\beta_{t,k}|\beta_{t-1,k} \sim N(\beta_{t-1,k},\sigma^2 I)

and

\alpha_{t}|\alpha_{t-1} \sim N(\alpha_{t-1},\delta^2 I)

The generative process at time slice 't' is therefore:

Draw topics $\beta_{t,k}|\beta_{t-1,k} \sim N(\beta_{t-1,k},\sigma^2 I) \forall k$
Draw mixture model $\alpha_{t}|\alpha_{t-1} \sim N(\alpha_{t-1},\delta^2 I)$
For each document:
1. Draw $\eta_{t,d} \sim N(\alpha_t,a^2 I)$
2. For each word:
  1. Draw topic $Z_{t,d,n} \sim \textrm{Mult}(\pi(\eta_{t,d}))$
  2. Draw word $W_{t,d,n} \sim \textrm{Mult}(\pi(\beta_{t,Z_{t,d,n}}))$

where $\pi(x)$ is a mapping from the natural parameterization x to the mean parameterization, namely

\pi(x_i) = \frac{\exp(x_i)}{\sum_i \exp(x_i)}

Inference

In the dynamic topic model, only $W_{t,d,n}$ is observable. Learning the other parameters constitutes an inference problem. Blei and Lafferty argue that applying Gibbs sampling to do inference in this model is more difficult than in static models, due to the nonconjugacy of the Gaussian and multinomial distributions. They propose the use of variational methods, in particular, the Variational Kalman Filtering and the Variational Wavelet Regression.

Application

In the original paper, a dynamic topic model is applied to the corpus of Science articles published between 1881 to 1999 aiming to show that this method can be used to analyze the trends of word usage inside topics.^[1] The authors also show that the model trained with past documents is able to fit documents of an incoming year better than LDA.

A continuous dynamic topic model was developed by Wang et al. and applied to predict the timestamp of documents.^[3]

References

1 2 Blei, David M; Lafferty, John D (2006). "Dynamic topic models". Proceedings of the ICML. ICML'06: 113–120. ISBN 1-59593-383-2.
↑ Rennie, Jason D. M. "Mixtures of Multinomials" (PDF). Retrieved 5 December 2011.
↑ Wang, Chong; Blei, David; Heckerman, David (2008). "Continuous Time Dynamic Topic Models". Proceedings of ICML. ICML '08.

This article is issued from Wikipedia - version of the Tuesday, March 01, 2016. The text is available under the Creative Commons Attribution/Share Alike but additional terms may apply for the media files.