Table of Contents
Fetching ...

Exploring Multi-Grained Concept Annotations for Multimodal Large Language Models

Xiao Xu, Tianhao Niu, Yuxi Xie, Libo Qin, Wanxiang Che, Min-Yen Kan

TL;DR

The paper tackles the limited use of concepts in multimodal large language models by introducing MMGiC, a dataset of multimodal multi-grained concept annotations, and a general autoregressive discrete MLLM framework that processes interleaved visual and textual tokens. It demonstrates that combining coarse image captions with fine-grained object/region annotations yields complementary grounding across granularities, improving vision-language alignment on 12 benchmarks. Empirical results show that appropriately paced data strategies, notably a curriculum where image-caption data precedes MMGiC data, yield meaningful gains (e.g., improvements on POPE and SEED-Bench) and that MMGiC can serve as both pre-training and instruction-fine-tuning data. The work highlights practical implications for grounding concepts more deeply and broadly, while outlining avenues for scaling annotations and expanding to additional VL tasks and abstract concepts. The approach relies on an autoregressive objective L = $\sum_{i=1}^{|u|} \log P(u_i \mid u_1, \dots, u_{i-1})$, trained with a discrete VL vocabulary and shared framework across modalities.

Abstract

Multimodal Large Language Models (MLLMs) excel in vision--language tasks by pre-training solely on coarse-grained concept annotations (e.g., image captions). We hypothesize that integrating fine-grained concept annotations (e.g., object labels and object regions) will further improve performance, as both data granularities complement each other in terms of breadth and depth in concept representation. We introduce a new dataset featuring Multimodal Multi-Grained Concept annotations (MMGiC) for MLLMs. In constructing MMGiC, we explore the impact of different data recipes on multimodal comprehension and generation. Our analyses reveal that multi-grained concept annotations integrate and complement each other, under our structured template and a general MLLM framework. We clearly explore and demonstrate the potential of MMGiC to help MLLMs better locate and learn concepts, aligning vision and language at multiple granularities. We further validate our hypothesis by investigating the fair comparison and effective collaboration between MMGiC and image--caption data on 12 multimodal comprehension and generation benchmarks, e.g., their appropriate combination achieve 3.95% and 2.34% absolute improvements over image--caption data alone on POPE and SEED-Bench. Code, data and models will be available at https://github.com/LooperXX/MMGiC.

Exploring Multi-Grained Concept Annotations for Multimodal Large Language Models

TL;DR

The paper tackles the limited use of concepts in multimodal large language models by introducing MMGiC, a dataset of multimodal multi-grained concept annotations, and a general autoregressive discrete MLLM framework that processes interleaved visual and textual tokens. It demonstrates that combining coarse image captions with fine-grained object/region annotations yields complementary grounding across granularities, improving vision-language alignment on 12 benchmarks. Empirical results show that appropriately paced data strategies, notably a curriculum where image-caption data precedes MMGiC data, yield meaningful gains (e.g., improvements on POPE and SEED-Bench) and that MMGiC can serve as both pre-training and instruction-fine-tuning data. The work highlights practical implications for grounding concepts more deeply and broadly, while outlining avenues for scaling annotations and expanding to additional VL tasks and abstract concepts. The approach relies on an autoregressive objective L = , trained with a discrete VL vocabulary and shared framework across modalities.

Abstract

Multimodal Large Language Models (MLLMs) excel in vision--language tasks by pre-training solely on coarse-grained concept annotations (e.g., image captions). We hypothesize that integrating fine-grained concept annotations (e.g., object labels and object regions) will further improve performance, as both data granularities complement each other in terms of breadth and depth in concept representation. We introduce a new dataset featuring Multimodal Multi-Grained Concept annotations (MMGiC) for MLLMs. In constructing MMGiC, we explore the impact of different data recipes on multimodal comprehension and generation. Our analyses reveal that multi-grained concept annotations integrate and complement each other, under our structured template and a general MLLM framework. We clearly explore and demonstrate the potential of MMGiC to help MLLMs better locate and learn concepts, aligning vision and language at multiple granularities. We further validate our hypothesis by investigating the fair comparison and effective collaboration between MMGiC and image--caption data on 12 multimodal comprehension and generation benchmarks, e.g., their appropriate combination achieve 3.95% and 2.34% absolute improvements over image--caption data alone on POPE and SEED-Bench. Code, data and models will be available at https://github.com/LooperXX/MMGiC.

Paper Structure

This paper contains 83 sections, 2 equations, 12 figures, 24 tables.

Figures (12)

  • Figure 1: Structured template (Left) and data example (Right) of MMGiC. Different colored text indicates template text, image placeholders, annotation placeholders and multi-grained concept annotations, respectively. Each image--text interleaved data sample will be tokenized into discrete tokens.
  • Figure 2: Illustration of our general MLLM framework. Only the LLM are loaded and partially fine-tuned during training.
  • Figure 3: Comparison of generated captions by MLLMs pre-trained with different data recipes. MMGiC(C), MMGiC(CLD) and MMGiC(CLDR) denote the {$0,2,3$}-th data recipes in Table \ref{['tab:data_recipes']}, respectively. The bottom right of each example shows associated label--description pairs from MMGiC.
  • Figure 4: Comparison of generated images by MLLMs pre-trained with different data recipes (Left) and image editing and multimodal in-context image synthesis examples (Right).
  • Figure 5: Analysis on $8$ dimensions of SEED-Bench-IMG. Left: the performance of MLLM-MMGiC trained with different-grained concept annotations from MMGiC. Right: corresponding case studies. CG, FG, and MG denote MLLMs trained with coarse-, fine-, and multi-grained concept annotations from MMGiC, respectively. ✅ denote the ground truth; ❎ denote incorrect prediction(s).
  • ...and 7 more figures