VoCoT: Unleashing Visually Grounded Multi-Step Reasoning in Large Multi-Modal Models

Zejun Li; Ruipu Luo; Jiwen Zhang; Minghui Qiu; Xuanjing Huang; Zhongyu Wei

VoCoT: Unleashing Visually Grounded Multi-Step Reasoning in Large Multi-Modal Models

Zejun Li, Ruipu Luo, Jiwen Zhang, Minghui Qiu, Xuanjing Huang, Zhongyu Wei

TL;DR

VoCoT addresses the limitation of single-step reasoning in large multi-modal models by introducing visually grounded object-centric chain-of-thought reasoning. It couples VoCoT-formatted reasoning with RefBind-based object grounding and a three-stage training pipeline to produce VolCano, a 7B-parameter LMM that achieves state-of-the-art results on spatial and compositional benchmarks like CLEVR and EmbSpatial. A dedicated VoCoT-Instruct-80K dataset enables instruction tuning for multi-step, visually grounded reasoning, and extensive ablations demonstrate the importance of object-centric grounding and interleaved multi-modal pre-training. The work advances reliable, interpretable multi-modal reasoning and suggests strong potential for broader grounding in vision-language systems, with public artifacts released for community use.

Abstract

While large multi-modal models (LMMs) have exhibited impressive capabilities across diverse tasks, their effectiveness in handling complex tasks has been limited by the prevailing single-step reasoning paradigm. To this end, this paper proposes VoCoT, a multi-step Visually grounded object-centric Chain-of-Thought reasoning framework tailored for inference with LMMs. VoCoT is characterized by two key features: (1) object-centric reasoning paths that revolve around cross-modal shared object-level information, and (2) visually grounded representation of object concepts in a multi-modal interleaved and aligned manner, which effectively bridges the modality gap within LMMs during long-term generation. To adapt LMMs in reasoning with VoCoT, we further construct an instruction-tuning dataset. By combining VoCoT with the prevalent open-source LMM architectures, we develop a VoCoT-based model, VolCano. With only 7B parameters and limited input image resolution, VolCano demonstrates excellent performance across various scenarios. In benchmarks like CLEVR and EmbSpatial, which highly require complex reasoning capabilities, VolCano outperforms SOTA models, including powerful GPT-4V. Related code, data and models are released in https://github.com/RupertLuo/VoCoT.

VoCoT: Unleashing Visually Grounded Multi-Step Reasoning in Large Multi-Modal Models

TL;DR

Abstract

Paper Structure (69 sections, 13 figures, 18 tables)

This paper contains 69 sections, 13 figures, 18 tables.

Introduction
Visually-grounded Object-centric CoT
VoCoT Formulation
VoCoT-Instruct-80K Dataset
Type 1: GQA Source
Type 2: VQA-Based Source
Type 3: Image-Only Source
VolCano: A VoCoT-enhanced LMM
Architecture
Representations of Multi-modal Sequences
RefBind
Training
Stage 1: Alignment Pre-training
Stage 2: Multi-modal Interleaved Pre-training
Stage 3: Instruction Tuning
...and 54 more sections

Figures (13)

Figure 1: An example to compare different inference paradigms in LMMs. (a) A visual question that requires complex reasoning. (b) The conceptual object-centric reasoning path constructed to solve the problem. (c) Outputs of GPT-4V and the proposed VolCano. Hallucination is included in the output of GPT-4V. VoCalno performs multi-step reasoning in the VoCoT format. In the reasoning path, key objects are highlighted and colors indicate the correspondence between object descriptions and the grounded regions in the image. "[box]" represents the coordinates of mentioned objects. Visual representations of objects are omitted for brevity.
Figure 2: Illustration of the VolCano framework. The input and output are shown below and above the model, respectively. The blue and green rounded rectangles represent textual and visual tokens, respectively. Special tokens "[c]" and "[/c]" denotes the beginning and end of the coordinates ("[coor.]" in the figure). Coordinates are represented in text. In the output, we visualize coordinates by drawing corresponding boxes in the image for a better illustration. RefBind obtains the representations of objects with the image features and predicted coordinates.
Figure 3: Illustration of the RefBind mechanism.
Figure 4: Qualitative analysis to compare VoCoT and text-only CoT. Hallucinations are underlined.
Figure 5: Relationship between performance and the number of reasoning steps required by the questions.
...and 8 more figures

VoCoT: Unleashing Visually Grounded Multi-Step Reasoning in Large Multi-Modal Models

TL;DR

Abstract

VoCoT: Unleashing Visually Grounded Multi-Step Reasoning in Large Multi-Modal Models

Authors

TL;DR

Abstract

Table of Contents

Figures (13)