Table of Contents
Fetching ...

v1: Learning to Point Visual Tokens for Multimodal Grounded Reasoning

Jiwan Chung, Junhyeok Kim, Siyeol Kim, Jaeyoung Lee, Min Soo Kim, Youngjae Yu

TL;DR

This work addresses the problem of visual grounding decay in multimodal reasoning by introducing v1, a lightweight point-and-copy mechanism that allows autoregressive models to reference and copy from continuous image embeddings during generation. To train and evaluate this capability at scale, the authors construct v1g, a 300K-trace dataset with fine-grained visual grounding annotations. Empirical results on MathVista, MathVision, and MathVerse show that dynamic visual referencing improves multimodal reasoning, especially for tasks demanding precise grounding of visual evidence. The approach remains computationally efficient, requires only two extra heads, and generalizes across backbones, offering a practical path toward more grounded, interpretable multimodal reasoning. The work also outlines a rich data-generation and evaluation pipeline and proposes future extensions to other modalities and controllable generation.

Abstract

When thinking with images, humans rarely rely on a single glance: they revisit visual information repeatedly during reasoning. However, existing models typically process images only once and thereafter generate reasoning entirely in text, lacking mechanisms to re-access or ground inference in visual representations. We empirically confirm this: as reasoning chains lengthen, models progressively lose focus on relevant regions. In response, we introduce v1, a lightweight extension that enables active visual referencing through a simple point-and-copy approach. This allows the model to identify relevant image patches and copy their embeddings back into the reasoning stream, ensuring that evolving hypotheses remain grounded in perceptual evidence. Crucially, our pointing strategy lets the MLLM directly select image patches using their semantic representations as keys, keeping perceptual evidence embedded in the same space as the model's reasoning. To train this capability, we construct v1g, a dataset of 300K multimodal reasoning traces with interleaved visual grounding annotations. Across various multimodal mathematical reasoning benchmarks, v1 consistently outperforms comparable baselines, establishing point-and-copy as a practical mechanism for grounded reasoning. The model checkpoint and dataset are available at github.com/jun297/v1.

v1: Learning to Point Visual Tokens for Multimodal Grounded Reasoning

TL;DR

This work addresses the problem of visual grounding decay in multimodal reasoning by introducing v1, a lightweight point-and-copy mechanism that allows autoregressive models to reference and copy from continuous image embeddings during generation. To train and evaluate this capability at scale, the authors construct v1g, a 300K-trace dataset with fine-grained visual grounding annotations. Empirical results on MathVista, MathVision, and MathVerse show that dynamic visual referencing improves multimodal reasoning, especially for tasks demanding precise grounding of visual evidence. The approach remains computationally efficient, requires only two extra heads, and generalizes across backbones, offering a practical path toward more grounded, interpretable multimodal reasoning. The work also outlines a rich data-generation and evaluation pipeline and proposes future extensions to other modalities and controllable generation.

Abstract

When thinking with images, humans rarely rely on a single glance: they revisit visual information repeatedly during reasoning. However, existing models typically process images only once and thereafter generate reasoning entirely in text, lacking mechanisms to re-access or ground inference in visual representations. We empirically confirm this: as reasoning chains lengthen, models progressively lose focus on relevant regions. In response, we introduce v1, a lightweight extension that enables active visual referencing through a simple point-and-copy approach. This allows the model to identify relevant image patches and copy their embeddings back into the reasoning stream, ensuring that evolving hypotheses remain grounded in perceptual evidence. Crucially, our pointing strategy lets the MLLM directly select image patches using their semantic representations as keys, keeping perceptual evidence embedded in the same space as the model's reasoning. To train this capability, we construct v1g, a dataset of 300K multimodal reasoning traces with interleaved visual grounding annotations. Across various multimodal mathematical reasoning benchmarks, v1 consistently outperforms comparable baselines, establishing point-and-copy as a practical mechanism for grounded reasoning. The model checkpoint and dataset are available at github.com/jun297/v1.

Paper Structure

This paper contains 51 sections, 5 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: Pure text-based reasoning vs. v1 during inference. Our v1 can actively re-access visual context by pointing to and copying relevant image regions throughout the reasoning process.
  • Figure 2: Inference process of v1. At each step, the MLLM encodes the multimodal context and generation history into token representations. For the last token (e.g., "<region>"), (a) a copy head projects its representation and computes logits against image patch embeddings, (b) a language head produces logits over the vocabulary, and (c) the two are concatenated to form the final distribution. If a patch is chosen, its embedding is copied as the next token input, enabling v1 to reference image regions one patch at a time.
  • Figure 3: Attention dynamics during reasoning. (a) illustrates a gradual decrease in overall attention to the input image tokens, while (b) indicates that semantically important visual regions receive disproportionately low attention, suggesting inefficient grounding during reasoning.
  • Figure 4: Qualitative comparison on MathVision.v1's dynamic grounding helps to solve both bar graph and spatial reasoning tasks, while LLaVA-CoT misinterprets visual content in both cases.
  • Figure 5: Comparison of attention to copy tokens vs. original visual tokens. Layer-wise sum of attention scores directed to copy tokens and their corresponding original visual input tokens from a v1 output on a MathVision example. Copy token intervals are highlighted in yellow.
  • ...and 4 more figures