Table of Contents
Fetching ...

PatchCue: Enhancing Vision-Language Model Reasoning with Patch-Based Visual Cues

Yukun Qi, Pei Fu, Hang Li, Yuhan Liu, Chao Jiang, Bin Qin, Zhenbo Luo, Jian Luan

TL;DR

This work proposes PatchCue, a novel patch-based visual cue paradigm designed to significantly enhance the visual reasoning capabilities of VLMs and shows that patch-level cues outperform both pixel-level bounding boxes and point-based cues, providing a more effective and cognitively aligned visual reasoning paradigm.

Abstract

Vision-Language Models (VLMs) have achieved remarkable progress on a wide range of challenging multimodal understanding and reasoning tasks. However, existing reasoning paradigms, such as the classical Chain-of-Thought (CoT), rely solely on textual information and often underutilize important visual cues. While prior work has incorporated pixel-level visual cues, these representations require precise spatial localization, introducing additional learning complexity. To address this, we propose PatchCue, a novel patch-based visual cue paradigm designed to significantly enhance the visual reasoning capabilities of VLMs. By partitioning images into patches and representing cues at the patch level, PatchCue aligns better with human perceptual habits and leverages the patch-tokenized input of modern VLMs. We train VLMs using a two-stage approach: cold-start supervised fine-tuning to output patch-level cues, followed by reinforcement learning with a process-supervised cue reward that guides intermediate visual reasoning steps. Extensive experiments on multiple VLMs and diverse benchmarks, including general visual question answering, complex reasoning, and document understanding, demonstrate that PatchCue consistently improves overall model performance. Our results show that patch-level cues outperform both pixel-level bounding boxes and point-based cues, providing a more effective and cognitively aligned visual reasoning paradigm.

PatchCue: Enhancing Vision-Language Model Reasoning with Patch-Based Visual Cues

TL;DR

This work proposes PatchCue, a novel patch-based visual cue paradigm designed to significantly enhance the visual reasoning capabilities of VLMs and shows that patch-level cues outperform both pixel-level bounding boxes and point-based cues, providing a more effective and cognitively aligned visual reasoning paradigm.

Abstract

Vision-Language Models (VLMs) have achieved remarkable progress on a wide range of challenging multimodal understanding and reasoning tasks. However, existing reasoning paradigms, such as the classical Chain-of-Thought (CoT), rely solely on textual information and often underutilize important visual cues. While prior work has incorporated pixel-level visual cues, these representations require precise spatial localization, introducing additional learning complexity. To address this, we propose PatchCue, a novel patch-based visual cue paradigm designed to significantly enhance the visual reasoning capabilities of VLMs. By partitioning images into patches and representing cues at the patch level, PatchCue aligns better with human perceptual habits and leverages the patch-tokenized input of modern VLMs. We train VLMs using a two-stage approach: cold-start supervised fine-tuning to output patch-level cues, followed by reinforcement learning with a process-supervised cue reward that guides intermediate visual reasoning steps. Extensive experiments on multiple VLMs and diverse benchmarks, including general visual question answering, complex reasoning, and document understanding, demonstrate that PatchCue consistently improves overall model performance. Our results show that patch-level cues outperform both pixel-level bounding boxes and point-based cues, providing a more effective and cognitively aligned visual reasoning paradigm.
Paper Structure (29 sections, 12 equations, 10 figures, 7 tables)

This paper contains 29 sections, 12 equations, 10 figures, 7 tables.

Figures (10)

  • Figure 1: Comparison of reasoning with different cue types: (a) Text-only: reasoning based solely on textual information; (b) Pixel-bbox: cues represented as precise pixel-level bounding boxes; (c) Pixel-point: cues indicated by single pixel points highlighting key regions; (d) Patch-bbox: cues represented as patch-level regions to capture localized visual information; (e) SFT training comparison shows that patch-based cues improve model performance more effectively than pixel-bbox or pixel-point cues.
  • Figure 2: Overview of PatchCue. We divide images into fixed-size patches in order to represent important regions as visual cues. During the model’s reasoning process, it is essential not only to identify which patches are relevant to the given question but also to accurately reference and integrate these cues throughout each reasoning step. This structured use of patch-level cues helps the model ground its intermediate reasoning in the visual content, improving both interpretability and overall performance.
  • Figure 3: Data Pipeline. Starting from the collected original data, we filter to obtain challenging samples. Then extract and ground the key visual cues in the images, and finally construct new reasoning sequences based on these cues.
  • Figure 4: Data Distribution. In the left figure, we show the distribution of the number of cues per sample, where most cue data are concentrated between 2 and 5 cues; in the right figure, we show the distribution of the proportion of cue regions, with the majority of samples having cue regions occupying less than 40% of the image.
  • Figure 5: Case Study. We compare the model’s outputs before and after PatchCue training. After training, the model can generate visual cues during reasoning, improving both its perception and the interpretability of its reasoning process.
  • ...and 5 more figures