Table of Contents
Fetching ...

CompCap: Improving Multimodal Large Language Models with Composite Captions

Xiaohui Chen, Satya Narayan Shukla, Mahmoud Azab, Aashu Singh, Qifan Wang, David Yang, ShengYun Peng, Hanchao Yu, Shen Yan, Xuewen Zhang, Baosheng He

TL;DR

This work addresses the gap in CI understanding for Multimodal Large Language Models by introducing CompCap, a universal framework that automatically synthesizes accurate and detailed captions for composite images. It produces CompCap-118K, a dataset spanning six CI types via six generation pipelines, and demonstrates that integrating this CI-caption data during supervised fine-tuning significantly improves CI comprehension across eleven benchmarks (average gains of 1.7–2.9 percentage points on 4B/7B/13B models). The study includes extensive ablations showing that each CI type and the use of caption data enhance vision-language alignment, with robust results on both CI- and NI-dominated tasks. The work advances practical CI understanding for real-world visuals like collages, charts, diagrams, code, and tables, enabling more reliable reasoning and information extraction in multimodal settings.

Abstract

How well can Multimodal Large Language Models (MLLMs) understand composite images? Composite images (CIs) are synthetic visuals created by merging multiple visual elements, such as charts, posters, or screenshots, rather than being captured directly by a camera. While CIs are prevalent in real-world applications, recent MLLM developments have primarily focused on interpreting natural images (NIs). Our research reveals that current MLLMs face significant challenges in accurately understanding CIs, often struggling to extract information or perform complex reasoning based on these images. We find that existing training data for CIs are mostly formatted for question-answer tasks (e.g., in datasets like ChartQA and ScienceQA), while high-quality image-caption datasets, critical for robust vision-language alignment, are only available for NIs. To bridge this gap, we introduce Composite Captions (CompCap), a flexible framework that leverages Large Language Models (LLMs) and automation tools to synthesize CIs with accurate and detailed captions. Using CompCap, we curate CompCap-118K, a dataset containing 118K image-caption pairs across six CI types. We validate the effectiveness of CompCap-118K by supervised fine-tuning MLLMs of three sizes: xGen-MM-inst.-4B and LLaVA-NeXT-Vicuna-7B/13B. Empirical results show that CompCap-118K significantly enhances MLLMs' understanding of CIs, yielding average gains of 1.7%, 2.0%, and 2.9% across eleven benchmarks, respectively.

CompCap: Improving Multimodal Large Language Models with Composite Captions

TL;DR

This work addresses the gap in CI understanding for Multimodal Large Language Models by introducing CompCap, a universal framework that automatically synthesizes accurate and detailed captions for composite images. It produces CompCap-118K, a dataset spanning six CI types via six generation pipelines, and demonstrates that integrating this CI-caption data during supervised fine-tuning significantly improves CI comprehension across eleven benchmarks (average gains of 1.7–2.9 percentage points on 4B/7B/13B models). The study includes extensive ablations showing that each CI type and the use of caption data enhance vision-language alignment, with robust results on both CI- and NI-dominated tasks. The work advances practical CI understanding for real-world visuals like collages, charts, diagrams, code, and tables, enabling more reliable reasoning and information extraction in multimodal settings.

Abstract

How well can Multimodal Large Language Models (MLLMs) understand composite images? Composite images (CIs) are synthetic visuals created by merging multiple visual elements, such as charts, posters, or screenshots, rather than being captured directly by a camera. While CIs are prevalent in real-world applications, recent MLLM developments have primarily focused on interpreting natural images (NIs). Our research reveals that current MLLMs face significant challenges in accurately understanding CIs, often struggling to extract information or perform complex reasoning based on these images. We find that existing training data for CIs are mostly formatted for question-answer tasks (e.g., in datasets like ChartQA and ScienceQA), while high-quality image-caption datasets, critical for robust vision-language alignment, are only available for NIs. To bridge this gap, we introduce Composite Captions (CompCap), a flexible framework that leverages Large Language Models (LLMs) and automation tools to synthesize CIs with accurate and detailed captions. Using CompCap, we curate CompCap-118K, a dataset containing 118K image-caption pairs across six CI types. We validate the effectiveness of CompCap-118K by supervised fine-tuning MLLMs of three sizes: xGen-MM-inst.-4B and LLaVA-NeXT-Vicuna-7B/13B. Empirical results show that CompCap-118K significantly enhances MLLMs' understanding of CIs, yielding average gains of 1.7%, 2.0%, and 2.9% across eleven benchmarks, respectively.

Paper Structure

This paper contains 45 sections, 1 equation, 33 figures, 9 tables.

Figures (33)

  • Figure 1: (a) CompCap implements image-caption synthesis pipelines for six composite image types. The composition of the curated CompCap-118K dataset are 42.3% Collage, 31.4% Image-Text, 18.7% Chart, 3.4% Table, 2.5% Diagram, and 1.7% Code. (b) Introducing CompCap-118K into the training data significantly improves MLLMs' performance on benchmarks comprising of composite images.
  • Figure 2: MLLMs demonstrate poorer understanding on CIs compared to NIs. (a) Example of assessing caption accuracy of MLLMs on CI with the help of LLMs. (b) MLLMs generally understand NIs much better than CIs. (c) Errors generated during captioning are consistent with errors generated in VQA, highlighting the necessity of caption data for better vision-language alignment
  • Figure 3: The CompCap Framework. The synthesis pipeline for different CI types implements CompCap differently.
  • Figure 4: The Collage implementation. We sample raw data from image-caption datasets and randomly generate a layout for the selected images. The images are then arranged into a collage following this layout, while an LLM generates a caption for the collage given both the layout details and captions of the individual images.
  • Figure 5: Ablation study of each CI category on LLaVA-NeXT-Vicuna-13B. We report the average scores over NI-dominated benchmarks $\mathbin{\Diamond}$($\mathbin{\blacklozenge}$) (SEEDBench, TextVQA, MMBench, MME, LLaVABench), CI-dominated benchmarks $\mathbin{\blacklozenge}$($\mathbin{\Diamond}$) (MathVista, OCRBench, ChartQA, DocVQA, InfoVQA, WebSRC), and all benchmarks. Baseline$^*$ stands for the original SFT data recipe in LLaVA-NeXT.
  • ...and 28 more figures