An Image is Worth $K$ Topics: A Visual Structural Topic Model with Pretrained Image Embeddings
Matías Piqueras, Alexandra Segerberg, Matteo Magnani, Måns Magnusson, Nataša Sladoje
TL;DR
The paper addresses the challenge of analyzing visual political content at scale by coupling pretrained image embeddings with a structural topic model (vSTM) to allow images to exhibit mixed topic memberships and to relate topic prevalence to covariates via a logistic-normal prior. It formalizes a generative model with image embeddings $\bm{z}_i$ drawn from a mixture of topic embeddings $\bm{\beta}_k$ and topic proportions $\bm{\theta}_i$, where $\bm{\theta}_i$ depends on covariates through $\bm{\Gamma}$ and a covariance structure $\bm{\Omega}_{\theta}$ with an LKJ prior. Inference is conducted with mean-field variational methods using reparameterization and minibatching, enabling scalable analysis, while quantities of interest include posterior means of topics and their covariate-driven prevalence; the authors apply the model to COP-related Twitter images encoded with CLIP, perform model selection (choosing $K=45$), and validate coherence through human-involved intrusion tasks. The empirical results reveal distinct visual worlds by actor and stance, with interpretable topics and meaningful visual co-occurrence patterns, and demonstrate the framework’s potential for multimodal and cross-platform research. The work highlights the importance of pretrained embeddings for political image analysis, discusses limitations such as embedding biases, and points to future directions including improved interpretability and open-source tooling.
Abstract
Political scientists are increasingly interested in analyzing visual content at scale. However, the existing computational toolbox is still in need of methods and models attuned to the specific challenges and goals of social and political inquiry. In this article, we introduce a visual Structural Topic Model (vSTM) that combines pretrained image embeddings with a structural topic model. This has important advantages compared to existing approaches. First, pretrained embeddings allow the model to capture the semantic complexity of images relevant to political contexts. Second, the structural topic model provides the ability to analyze how topics and covariates are related, while maintaining a nuanced representation of images as a mixture of multiple topics. In our empirical application, we show that the vSTM is able to identify topics that are interpretable, coherent, and substantively relevant to the study of online political communication.
