Table of Contents
Fetching ...

Beyond Static Visual Tokens: Structured Sequential Visual Chain-of-Thought Reasoning

Guangfu Guo, Xiaoqian Lu, Yue Feng, Mingming Sun

Abstract

Current multimodal LLMs encode images as static visual prefixes and rely on text-based reasoning, lacking goal-driven and adaptive visual access. Inspired by human visual perception-where attention is selectively and sequentially shifted from the most informative regions to secondary cues-we propose Structural Sequential Visual CoT SSV-CoT. First, a question-relevant saliency map identifies and organizes key visual regions, explicitly modeling the spatial distribution of visual importance. Second, reasoning is performed following this discriminative order, inducing a curriculum-like semantic progression from primary to secondary cues. This method is trained end-to-end, using text cot and answer supervision, without relying on region-level annotations or specialized external tools. Experiments on diverse visual reasoning benchmarks show gains, validating structured and sequential visual cognition.

Beyond Static Visual Tokens: Structured Sequential Visual Chain-of-Thought Reasoning

Abstract

Current multimodal LLMs encode images as static visual prefixes and rely on text-based reasoning, lacking goal-driven and adaptive visual access. Inspired by human visual perception-where attention is selectively and sequentially shifted from the most informative regions to secondary cues-we propose Structural Sequential Visual CoT SSV-CoT. First, a question-relevant saliency map identifies and organizes key visual regions, explicitly modeling the spatial distribution of visual importance. Second, reasoning is performed following this discriminative order, inducing a curriculum-like semantic progression from primary to secondary cues. This method is trained end-to-end, using text cot and answer supervision, without relying on region-level annotations or specialized external tools. Experiments on diverse visual reasoning benchmarks show gains, validating structured and sequential visual cognition.

Paper Structure

This paper contains 37 sections, 25 equations, 3 figures, 5 tables, 1 algorithm.

Figures (3)

  • Figure 1: MLLM lacks visual analysis, while SSV integrates cues through structured sequential reasoning to reach the correct answer.
  • Figure 2: Overview of the SSV-CoT framework. The model first constructs question-aware structured visual regions, then performs sequential visual access during chain-of-thought reasoning to progressively integrate visual evidence and generate the final answer.
  • Figure 3: Comparison of Text-CoT and SSV-CoT on Qwen2-VL-7B for visual reasoning. Errors are shown in red. Only selected regions are displayed, and numbers indicate the token injection order.