Table of Contents
Fetching ...

Thinking with Images as Continuous Actions: Numerical Visual Chain-of-Thought

Kesen Zhao, Beier Zhu, Junbao Zhou, Xingyu Zhu, Zhongqi Yue, Hanwang Zhang

TL;DR

Numerical Visual Chain-of-Thought (NV-CoT), a framework that enables MLLMs to reason over images using continuous numerical coordinates, is proposed and extensive experiments demonstrate that NV-CoT significantly improves localization precision and final answer accuracy, while also accelerating training convergence.

Abstract

Recent multimodal large language models (MLLMs) increasingly rely on visual chain-of-thought to perform region-grounded reasoning over images. However, existing approaches ground regions via either textified coordinates-causing modality mismatch and semantic fragmentation or fixed-granularity patches that both limit precise region selection and often require non-trivial architectural changes. In this paper, we propose Numerical Visual Chain-of-Thought (NV-CoT), a framework that enables MLLMs to reason over images using continuous numerical coordinates. NV-CoT expands the MLLM action space from discrete vocabulary tokens to a continuous Euclidean space, allowing models to directly generate bounding-box coordinates as actions with only minimal architectural modification. The framework supports both supervised fine-tuning and reinforcement learning. In particular, we replace categorical token policies with a Gaussian (or Laplace) policy over coordinates and introduce stochasticity via reparameterized sampling, making NV-CoT fully compatible with GRPO-style policy optimization. Extensive experiments on three benchmarks against eight representative visual reasoning baselines demonstrate that NV-CoT significantly improves localization precision and final answer accuracy, while also accelerating training convergence, validating the effectiveness of continuous-action visual reasoning in MLLMs. The code is available in https://github.com/kesenzhao/NV-CoT.

Thinking with Images as Continuous Actions: Numerical Visual Chain-of-Thought

TL;DR

Numerical Visual Chain-of-Thought (NV-CoT), a framework that enables MLLMs to reason over images using continuous numerical coordinates, is proposed and extensive experiments demonstrate that NV-CoT significantly improves localization precision and final answer accuracy, while also accelerating training convergence.

Abstract

Recent multimodal large language models (MLLMs) increasingly rely on visual chain-of-thought to perform region-grounded reasoning over images. However, existing approaches ground regions via either textified coordinates-causing modality mismatch and semantic fragmentation or fixed-granularity patches that both limit precise region selection and often require non-trivial architectural changes. In this paper, we propose Numerical Visual Chain-of-Thought (NV-CoT), a framework that enables MLLMs to reason over images using continuous numerical coordinates. NV-CoT expands the MLLM action space from discrete vocabulary tokens to a continuous Euclidean space, allowing models to directly generate bounding-box coordinates as actions with only minimal architectural modification. The framework supports both supervised fine-tuning and reinforcement learning. In particular, we replace categorical token policies with a Gaussian (or Laplace) policy over coordinates and introduce stochasticity via reparameterized sampling, making NV-CoT fully compatible with GRPO-style policy optimization. Extensive experiments on three benchmarks against eight representative visual reasoning baselines demonstrate that NV-CoT significantly improves localization precision and final answer accuracy, while also accelerating training convergence, validating the effectiveness of continuous-action visual reasoning in MLLMs. The code is available in https://github.com/kesenzhao/NV-CoT.
Paper Structure (21 sections, 17 equations, 6 figures, 3 tables)

This paper contains 21 sections, 17 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Our NV-CoT outperforms text-based visual CoT models (Vis-CoT shao2024visual and DeepEyes zheng2025deepeyes) in localization precision, answer accuracy, and convergence speed across both SFT and RL. SFT-based models are evaluated on the Vis-CoT-363K dataset shao2024visual, where ground-truth bounding boxes are available, while RL-based models are evaluated on the DeepEyes-47K dataset zheng2025deepeyes. We only replace the text-space discrete coordinate objective with our Euclidean-space continuous one, while keeping all other training configurations unchanged for a fair comparison.
  • Figure 2: Comparison of different paradigms for thinking with images. (a) Text-based approaches represent localized regions as discrete coordinate tokens, leading to modality mismatch and fragmented semantics. (b) Patch-based approaches reason directly over fine-grained visual tokens but are constrained by the fixed spatial granularity of the vision backbone. (c) Our NV-CoT predicts region coordinates in continuous space, enabling flexible and precise localization.
  • Figure 3: Behavior of $\alpha$. Successful trajectories exhibit smaller $\alpha$ than failed ones, reflecting higher confidence.
  • Figure 4: Effect of $\lambda$. Performance peaks at $\lambda=0.3$ and NV-CoT consistently outperforms the baseline across all values.
  • Figure 5: Visualization of bounding boxes. NV-CoT produces more accurate bounding boxes (shown in red) compared to the backbone model (shown in blue), demonstrating improved localization capability.
  • ...and 1 more figures