Table of Contents
Fetching ...

CEMTM: Contextual Embedding-based Multimodal Topic Modeling

Amirhossein Abaskohi, Raymond Li, Chuyuan Li, Shafiq Joty, Giuseppe Carenini

TL;DR

CEMTM addresses the challenge of uncovering coherent, interpretable topics from multimodal documents by leveraging fine-tuned vision–language model embeddings to produce contextualized token representations and a distributional importance network that weights token contributions. A reconstruction objective aligns topic-based representations with the LVLM-derived document embedding, promoting semantic consistency across text and images, while enabling efficient processing of documents with multiple images in a single pass. The model extracts explicit word-topic and document-topic distributions, improving coherence, diversity, and downstream retrieval performance across six diverse benchmarks, including long-form scientific and encyclopedic content, with strong few-shot QA benefits. Empirically, CEMTM demonstrates state-of-the-art topic quality and cross-domain generalization, and ablative analyses confirm the value of LVLM grounding and distributional supervision for interpretable, multimodal topic reasoning.

Abstract

We introduce CEMTM, a context-enhanced multimodal topic model designed to infer coherent and interpretable topic structures from both short and long documents containing text and images. CEMTM builds on fine-tuned large vision language models (LVLMs) to obtain contextualized embeddings, and employs a distributional attention mechanism to weight token-level contributions to topic inference. A reconstruction objective aligns topic-based representations with the document embedding, encouraging semantic consistency across modalities. Unlike existing approaches, CEMTM can process multiple images per document without repeated encoding and maintains interpretability through explicit word-topic and document-topic distributions. Extensive experiments on six multimodal benchmarks show that CEMTM consistently outperforms unimodal and multimodal baselines, achieving a remarkable average LLM score of 2.61. Further analysis shows its effectiveness in downstream few-shot retrieval and its ability to capture visually grounded semantics in complex domains such as scientific articles.

CEMTM: Contextual Embedding-based Multimodal Topic Modeling

TL;DR

CEMTM addresses the challenge of uncovering coherent, interpretable topics from multimodal documents by leveraging fine-tuned vision–language model embeddings to produce contextualized token representations and a distributional importance network that weights token contributions. A reconstruction objective aligns topic-based representations with the LVLM-derived document embedding, promoting semantic consistency across text and images, while enabling efficient processing of documents with multiple images in a single pass. The model extracts explicit word-topic and document-topic distributions, improving coherence, diversity, and downstream retrieval performance across six diverse benchmarks, including long-form scientific and encyclopedic content, with strong few-shot QA benefits. Empirically, CEMTM demonstrates state-of-the-art topic quality and cross-domain generalization, and ablative analyses confirm the value of LVLM grounding and distributional supervision for interpretable, multimodal topic reasoning.

Abstract

We introduce CEMTM, a context-enhanced multimodal topic model designed to infer coherent and interpretable topic structures from both short and long documents containing text and images. CEMTM builds on fine-tuned large vision language models (LVLMs) to obtain contextualized embeddings, and employs a distributional attention mechanism to weight token-level contributions to topic inference. A reconstruction objective aligns topic-based representations with the document embedding, encouraging semantic consistency across modalities. Unlike existing approaches, CEMTM can process multiple images per document without repeated encoding and maintains interpretability through explicit word-topic and document-topic distributions. Extensive experiments on six multimodal benchmarks show that CEMTM consistently outperforms unimodal and multimodal baselines, achieving a remarkable average LLM score of 2.61. Further analysis shows its effectiveness in downstream few-shot retrieval and its ability to capture visually grounded semantics in complex domains such as scientific articles.

Paper Structure

This paper contains 32 sections, 12 equations, 2 figures, 14 tables.

Figures (2)

  • Figure 1: Overall architecture of CEMTM. Articles containing both text and images are encoded using a fine-tuned vision–language model to produce contextualized embeddings. During training, only the decoder forward layer, encoder forward layer, and importance network are fine-tuned, while the underlying vision–language backbone remains frozen. The model learns to construct document topic vectors by weighting token embeddings through the importance network, with reconstruction loss guiding optimization.
  • Figure 2: LVLM Zero-shot TM uses LVLM embeddings for better multimodal alignment and more meaningful topic vectors than Multimodal Zero-shot TM.