Take A Step Back: Rethinking the Two Stages in Visual Reasoning

Mingyu Zhang; Jiting Cai; Mingyu Liu; Yue Xu; Cewu Lu; Yong-Lu Li

Take A Step Back: Rethinking the Two Stages in Visual Reasoning

Mingyu Zhang, Jiting Cai, Mingyu Liu, Yue Xu, Cewu Lu, Yong-Lu Li

TL;DR

This work reframes visual reasoning as a two-stage problem: symbolization (domain-specific grounding) followed by generic symbolic reasoning. It shows that a shared reasoner paired with task-specific encoders generalizes better across diverse domains than fully entangled or fully shared designs, supporting an approximation principle: training on multiple domains yields a stronger, cross-domain reasoner. Through extensive experiments across 2D puzzles, 3D intuitive physics, and VQA benchmarks, the authors demonstrate that a lightweight MLP-based reasoner with separated encoders achieves strong generalization and consistency, often outperforming more complex architectures and even some SOTA baselines. The study provides practical design principles, including optimal symbolization depth per task and multi-domain training strategies, paving the way for scalable, generalizable visual reasoning systems.

Abstract

Visual reasoning, as a prominent research area, plays a crucial role in AI by facilitating concept formation and interaction with the world. However, current works are usually carried out separately on small datasets thus lacking generalization ability. Through rigorous evaluation of diverse benchmarks, we demonstrate the shortcomings of existing ad-hoc methods in achieving cross-domain reasoning and their tendency to data bias fitting. In this paper, we revisit visual reasoning with a two-stage perspective: (1) symbolization and (2) logical reasoning given symbols or their representations. We find that the reasoning stage is better at generalization than symbolization. Thus, it is more efficient to implement symbolization via separated encoders for different data domains while using a shared reasoner. Given our findings, we establish design principles for visual reasoning frameworks following the separated symbolization and shared reasoning. The proposed two-stage framework achieves impressive generalization ability on various visual reasoning tasks, including puzzles, physical prediction, and visual question answering (VQA), encompassing both 2D and 3D modalities. We believe our insights will pave the way for generalizable visual reasoning.

Take A Step Back: Rethinking the Two Stages in Visual Reasoning

TL;DR

Abstract

Paper Structure (28 sections, 3 equations, 9 figures, 14 tables)

This paper contains 28 sections, 3 equations, 9 figures, 14 tables.

Introduction
Related Work
Preliminary
Two Stages
Symbolization-Reasoning Framework
Entanglement v.s. Disentanglement
Symbolization Depth
Reasoner Architecture
Generalization of Reasoner
Experiments
Dataset and Setting
Entanglement v.s. Disentanglement Analysis
Optimal Symbolization Depth
One-for-All Reasoner Architecture
Approximation Principle Verification
...and 13 more sections

Figures (9)

Figure 1: Comparison between end-to-end model, human, and our framework. Previous works usually use a specific end-to-end model for each task, while our framework shares a logical reasoner similar to human intelligence.
Figure 2: Entanglement v.s. Disentanglement. Type 1: the symbol encoder and reasoner are all separated; Type 2: both the encoder and reasoner are shared; Type 3: only the encoder is shared; Type 4: only the reasoner is shared.
Figure 3: Probing process of symbolization. We vary the depths of the symbol encoder (ResNet) and train the framework while recording the accuracy at each encoder depth. An inflection point occurs in the curve at moderate depths.
Figure 4: "Approximation principle" verification with a shared reasoner. In step 1, the process entails the selection of 1-4 datasets, namely SVRT, Bongard-HOI, CoPhy-Balls, and VQAv2, to train the reasoner. This combination offers a total of 15 possible permutations. In step 2, the proficiently trained reasoner is subjected to rigorous testing on the CoPhy-Collision dataset for evaluation and validation purposes.
Figure 5: Performance curve of varying encoder depth in symbolization stage. We present the results across RAVEN, CVR, SVRT, Bongard-LOGO, and Bongard-HOI. The highlight points refer to the distinct inflection points.
...and 4 more figures

Take A Step Back: Rethinking the Two Stages in Visual Reasoning

TL;DR

Abstract

Take A Step Back: Rethinking the Two Stages in Visual Reasoning

Authors

TL;DR

Abstract

Table of Contents

Figures (9)