Exploring Multi-Grained Concept Annotations for Multimodal Large Language Models
Xiao Xu, Tianhao Niu, Yuxi Xie, Libo Qin, Wanxiang Che, Min-Yen Kan
TL;DR
The paper tackles the limited use of concepts in multimodal large language models by introducing MMGiC, a dataset of multimodal multi-grained concept annotations, and a general autoregressive discrete MLLM framework that processes interleaved visual and textual tokens. It demonstrates that combining coarse image captions with fine-grained object/region annotations yields complementary grounding across granularities, improving vision-language alignment on 12 benchmarks. Empirical results show that appropriately paced data strategies, notably a curriculum where image-caption data precedes MMGiC data, yield meaningful gains (e.g., improvements on POPE and SEED-Bench) and that MMGiC can serve as both pre-training and instruction-fine-tuning data. The work highlights practical implications for grounding concepts more deeply and broadly, while outlining avenues for scaling annotations and expanding to additional VL tasks and abstract concepts. The approach relies on an autoregressive objective L = $\sum_{i=1}^{|u|} \log P(u_i \mid u_1, \dots, u_{i-1})$, trained with a discrete VL vocabulary and shared framework across modalities.
Abstract
Multimodal Large Language Models (MLLMs) excel in vision--language tasks by pre-training solely on coarse-grained concept annotations (e.g., image captions). We hypothesize that integrating fine-grained concept annotations (e.g., object labels and object regions) will further improve performance, as both data granularities complement each other in terms of breadth and depth in concept representation. We introduce a new dataset featuring Multimodal Multi-Grained Concept annotations (MMGiC) for MLLMs. In constructing MMGiC, we explore the impact of different data recipes on multimodal comprehension and generation. Our analyses reveal that multi-grained concept annotations integrate and complement each other, under our structured template and a general MLLM framework. We clearly explore and demonstrate the potential of MMGiC to help MLLMs better locate and learn concepts, aligning vision and language at multiple granularities. We further validate our hypothesis by investigating the fair comparison and effective collaboration between MMGiC and image--caption data on 12 multimodal comprehension and generation benchmarks, e.g., their appropriate combination achieve 3.95% and 2.34% absolute improvements over image--caption data alone on POPE and SEED-Bench. Code, data and models will be available at https://github.com/LooperXX/MMGiC.
