Scalable Dynamic Topic Modeling
November 15, 2022 Published by Federico Tomasi, Mounia Lalmas and Zhenwen Dai
Dynamic topic modeling is a well established tool for capturing the temporal dynamics of the topics of a corpus. A limitation of current dynamic topic models is that they can only consider a small set of frequent words because of their computational complexity and insufficient data for less frequent words. However, in many applications, the most important words that describe a topic are often the uncommon ones. For instance, in the machine learning literature, words such as “Wishart” are relatively uncommon, but they are strongly related to the posterior inference literature. In this work, we proposed a dynamic topic model that is able to capture the correlation between topics associated with each document, the evolution of topic correlation, word co-occurrence over time, and word correlation.
This is especially important when we consider textual documents, including spoken documents, because these may include lots of uncommon words and be relatively short. In such cases, as standard topic models rely on word co-occurrence, they may fail to identify topics, and may regard different words as noise without appropriate preprocessing.
Dynamic topic modeling with word correlation
In topic modeling literature, a topic is a probability distribution of the words appearing in a document. Each document is represented as a mixture of multiple topics. The components depicted in the graphical model are: the probability of a topic (µ), the topic correlation (∑) and topic-word association (β).
In this work, we consider similar words as concretizations of a single concept. This allows us to remove the need of ad-hoc preprocessing (such as tokenization or stemming) to pre-group similar words into the same meaning. This is achieved through the use of multi-output Gaussian processes, to learn a latent embedding for groups of words related with each other. The latent embedding is indicated by H, and the number of components is independent of the size of the vocabulary. On the other hand, the dynamics of each word for each topic is modelled by β, which will be conditioned on the latent embedding H.
Evaluation
We compared our model, called MIST (Multi-output with Importance Sampling Topic model) to state-of-the-art topic models, both static ones (LDA and CTM) and dynamic ones (DETM and DCTM). We used a total of 7 different datasets with different characteristics (e.g. in the number of documents per time and the time span).
The table below shows the average per-word perplexity on the test set per each dataset, measured as an exponent of the average negative ELBO (lower bound on the log-likelihood) per word (the lower, the better). MIST outperforms the baselines on all datasets. Importantly, we can see that for the datasets with more than 10k words, DCTM, while having the second best performance on the other datasets, could not be trained without running into out-of-memory issues.
Computational efficiency
We further investigated the computational efficiency difference between DCTM and MIST. In DCTM, the modelling of the words for each topic happens independently, hence the model needs to approximate every word with every topic. This is computational inefficient and infeasible for large vocabularies. In MIST, we overcome the problem by modelling a fixed number of latent variables only. This change is reflected in the computational difference between the two models.
We computed the average time to compute 5 epochs on a dataset with an increasing number of words while keeping a fixed number of samples at different time points. MIST consistently outperforms DCTM across all the vocabulary sizes. In particular, the computational benefit of MIST is more evident after reaching 100K (and 1M) words.
Modeling infrequent words
We analyze the performance of MIST in terms of modeling infrequent keywords compared to previous methods. We plot the per-year word counts of three key- words (“Wishart”, “Kalman” and “Hebbian”) in one of the datasets (NeurIPS). All three words on averages appear less than 100 times a year, which are infrequent, but they are all clear indicators of Machine Learning and neuroscience subfields. We then compare the word probabilities of each word in the topic with the strongest connection to the word inferred by MIST, DCTM and DETM. We plot the posterior of μ in the topic inferred by MIST where the word is most prominent, and the posterior of β for each one of the dynamic topic models we tested. MIST is shown to be the most accurate in modeling these words, by capturing the general dynamics of the word in the dataset but without overfitting it.
For example, consider the word “Wishart” corresponding to the Wishart distribution (and process). The word counts are very low (less than 50 counts in total each year). However, the word is very much indicative of Bayesian inference topic, as the Wishart distribution is the conjugate prior of the inverse covariance-matrix of a multivariate-normal. Indeed, it has a high correlation with the much more common word “posterior”. MIST is able to accurately model the increasing dynamics of such words over time (middle row). Conversely, DCTM and DETM that consider words independently do not have enough data to accurately model its dynamics.
Some final words
Our model is able to efficiently model large vocabularies through the use of multi output Gaussian processes, sparse Gaussian process approximations, and importance sampling procedures. As the model makes use of learnt word embeddings to relate words together, it does not need to see a large number of word co-occurrences for each document, making it the perfect candidate to analyze short-text documents.
For more information please check out our paper below and our previous blog post.
Efficient Inference for Dynamic Topic Modeling with Large Vocabularies
Federico Tomasi, Mounia Lalmas and Zhenwen Dai
UAI 2022