MMCR: Advancing Visual Language Model in Multimodal Multi-Turn Contextual Reasoning
Dawei Yan, Yang Li, Qing-Guo Chen, Weihua Luo, Peng Wang, Haokui Zhang, Chunhua Shen
TL;DR
MMCR addresses the need for robust multimodal multi-turn contextual reasoning by introducing MMCR-310k, a large-scale instruction-tuning dataset with 210k single-image and 100k multi-image dialogues (4 turns for single-image, 8 turns for multi-image) generated via GPT-4o and CLIP-filtered, and MMCR-Bench, a 600-dialogue evaluation benchmark across 8 domains and 40 subtopics scored by GPT-4o on five contextual-reasoning dimensions. The authors demonstrate that fine-tuning VLMs (e.g., Ovis variants) with MMCR yields improvements on MMCR-Bench (+5.2% contextual accuracy) and boosts performance on public benchmarks (e.g., +1.1% AI2D, +1.2% MMMU/MMVet), while revealing a 'less is more' phenomenon where balanced data distribution can outperform larger, unbalanced datasets. The dataset construction relies on OmniCorpus as a data foundation, GPT-4o-guided prompt engineering, and CLIP-based filtering to ensure alignment between images and dialogue, with MMCR-Bench annotated via imagenet labels and topic-split across 40 subtopics. These contributions advance practical, real-world AGI-style interactions by enabling longer-context, coherent multimodal dialogues and providing a principled framework for evaluating contextual reasoning in VLMs, with public release planned for MMCR and related prompts. Key effects include improved contextual coherence, cross-benchmark gains, and guidance on data quantity vs. quality for fine-tuning large multimodal models, emphasizing balanced coverage over sheer scale.
Abstract
Compared to single-turn dialogue, multi-turn dialogue involving multiple images better aligns with the needs of real-world human-AI interactions. Additionally, as training data, it provides richer contextual reasoning information, thereby guiding the model to achieve better performance. However, existing vision-language models (VLMs) primarily rely on single-turn dialogue training and evaluation benchmarks. In this paper, following the characteristics of human dialogue, such as focused topics and concise, clear content, we present MMCR (Multimodal Multi-turn Contextual Reasoning), a novel dataset comprising: (1) MMCR-310k -- the largest multi-image multi-turn instruction tuning dataset with 310K contextual dialogues, each covering 1-4 images and 4 or 8 dialogue turns; and (2) MMCR-Bench -- a diagnostic benchmark featuring dialogues, spanning 8 domains (Humanities, Natural, Science, Education, etc.) and 40 sub-topics. Extensive evaluations demonstrate that models fine-tuned with MMCR-310k achieve 5.2\% higher contextual accuracy on MMCR-Bench, while showing consistent improvements on existing benchmarks (+1.1\% on AI2D, +1.2\% on MMMU and MMVet). MMCR and prompt engineering will be released publicly.
