v1: Learning to Point Visual Tokens for Multimodal Grounded Reasoning

Jiwan Chung; Junhyeok Kim; Siyeol Kim; Jaeyoung Lee; Min Soo Kim; Youngjae Yu

v1: Learning to Point Visual Tokens for Multimodal Grounded Reasoning

Jiwan Chung, Junhyeok Kim, Siyeol Kim, Jaeyoung Lee, Min Soo Kim, Youngjae Yu

TL;DR

This work addresses the problem of visual grounding decay in multimodal reasoning by introducing v1, a lightweight point-and-copy mechanism that allows autoregressive models to reference and copy from continuous image embeddings during generation. To train and evaluate this capability at scale, the authors construct v1g, a 300K-trace dataset with fine-grained visual grounding annotations. Empirical results on MathVista, MathVision, and MathVerse show that dynamic visual referencing improves multimodal reasoning, especially for tasks demanding precise grounding of visual evidence. The approach remains computationally efficient, requires only two extra heads, and generalizes across backbones, offering a practical path toward more grounded, interpretable multimodal reasoning. The work also outlines a rich data-generation and evaluation pipeline and proposes future extensions to other modalities and controllable generation.

Abstract

When thinking with images, humans rarely rely on a single glance: they revisit visual information repeatedly during reasoning. However, existing models typically process images only once and thereafter generate reasoning entirely in text, lacking mechanisms to re-access or ground inference in visual representations. We empirically confirm this: as reasoning chains lengthen, models progressively lose focus on relevant regions. In response, we introduce v1, a lightweight extension that enables active visual referencing through a simple point-and-copy approach. This allows the model to identify relevant image patches and copy their embeddings back into the reasoning stream, ensuring that evolving hypotheses remain grounded in perceptual evidence. Crucially, our pointing strategy lets the MLLM directly select image patches using their semantic representations as keys, keeping perceptual evidence embedded in the same space as the model's reasoning. To train this capability, we construct v1g, a dataset of 300K multimodal reasoning traces with interleaved visual grounding annotations. Across various multimodal mathematical reasoning benchmarks, v1 consistently outperforms comparable baselines, establishing point-and-copy as a practical mechanism for grounded reasoning. The model checkpoint and dataset are available at github.com/jun297/v1.

v1: Learning to Point Visual Tokens for Multimodal Grounded Reasoning

TL;DR

Abstract

v1: Learning to Point Visual Tokens for Multimodal Grounded Reasoning

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (9)