VoCoT: Unleashing Visually Grounded Multi-Step Reasoning in Large Multi-Modal Models
Zejun Li, Ruipu Luo, Jiwen Zhang, Minghui Qiu, Xuanjing Huang, Zhongyu Wei
TL;DR
VoCoT addresses the limitation of single-step reasoning in large multi-modal models by introducing visually grounded object-centric chain-of-thought reasoning. It couples VoCoT-formatted reasoning with RefBind-based object grounding and a three-stage training pipeline to produce VolCano, a 7B-parameter LMM that achieves state-of-the-art results on spatial and compositional benchmarks like CLEVR and EmbSpatial. A dedicated VoCoT-Instruct-80K dataset enables instruction tuning for multi-step, visually grounded reasoning, and extensive ablations demonstrate the importance of object-centric grounding and interleaved multi-modal pre-training. The work advances reliable, interpretable multi-modal reasoning and suggests strong potential for broader grounding in vision-language systems, with public artifacts released for community use.
Abstract
While large multi-modal models (LMMs) have exhibited impressive capabilities across diverse tasks, their effectiveness in handling complex tasks has been limited by the prevailing single-step reasoning paradigm. To this end, this paper proposes VoCoT, a multi-step Visually grounded object-centric Chain-of-Thought reasoning framework tailored for inference with LMMs. VoCoT is characterized by two key features: (1) object-centric reasoning paths that revolve around cross-modal shared object-level information, and (2) visually grounded representation of object concepts in a multi-modal interleaved and aligned manner, which effectively bridges the modality gap within LMMs during long-term generation. To adapt LMMs in reasoning with VoCoT, we further construct an instruction-tuning dataset. By combining VoCoT with the prevalent open-source LMM architectures, we develop a VoCoT-based model, VolCano. With only 7B parameters and limited input image resolution, VolCano demonstrates excellent performance across various scenarios. In benchmarks like CLEVR and EmbSpatial, which highly require complex reasoning capabilities, VolCano outperforms SOTA models, including powerful GPT-4V. Related code, data and models are released in https://github.com/RupertLuo/VoCoT.
