ViC-Bench: Benchmarking Visual-Interleaved Chain-of-Thought Capability in MLLMs with Free-Style Intermediate State Representations
Xuecheng Wu, Jiaxing Liu, Danlei Huang, Yifan Wang, Yunyun Shi, Kedi Chen, Junxiao Xue, Yang Liu, Chunlin Chen, Hairong Dong, Dingkang Yang
TL;DR
ViC-Bench addresses the gap in evaluating Visual-Interleaved Chain-of-Thought by introducing free-style intermediate visual states and a three-stage evaluation across four tasks. It combines dedicated data pipelines, IVS generation with function calls, and the Incremental Prompting Information Injection strategy to dissect prompting factors and reasoning trajectories in 18 MLLMs. The results reveal meaningful ThinkGain when free-style IVS is present, but also susceptibility to Legality violations and model-specific failure modes, underscoring that current MLLMs vary greatly in leveraging visual feedback for grounded reasoning. The benchmark and its findings offer a foundation for advancing VI-CoT capabilities and guiding the development of more grounded, multi-modal reasoning systems with practical significance for complex tasks requiring dynamic visual reasoning.
Abstract
Visual-Interleaved Chain-of-Thought (VI-CoT) enables Multi-modal Large Language Models (MLLMs) to continually update their understanding and decision space based on step-wise intermediate visual states (IVS), much like a human would, which has demonstrated impressive success in various tasks, thereby leading to emerged advancements in related downstream benchmarks. Despite promising progress, current benchmarks provide models with relatively fixed IVS, rather than free-style IVS, whch might forcibly distort the original thinking trajectories, failing to evaluate their intrinsic reasoning capabilities. More importantly, existing benchmarks neglect to systematically explore the impact factors that IVS would impart to the untamed reasoning performance. To tackle above gaps, we introduce a specialized benchmark termed ViC-Bench, consisting of four representive tasks, i.e., maze navigation, jigsaw puzzle, embodied long-horizon planning, as well as complex counting, where each task has dedicated free-style IVS generation pipeline supporting adaptive function calls. To systematically examine VI-CoT capability, we propose a thorough evaluation suite incorporating a progressive three-stage strategy with targeted new metrics. Besides, we establish Incremental Prompting Information Injection strategy to ablatively explore the prompting factors for VI-CoT. We extensively conduct evaluations for 18 advanced MLLMs, revealing key insights into their VI-CoT capability. The introduced ViC-Bench has been made publicly available at Huggingface.
