Table of Contents
Fetching ...

Chatting with Images for Introspective Visual Thinking

Junfei Wu, Jian Guan, Qiang Liu, Shu Wu, Liang Wang, Wei Wu, Tienie Tan

TL;DR

The paper tackles information loss in large vision-language models by introducing a unified, interactive reasoning paradigm called "chatting with images" and a dedicated dynamic vision encoder (ViLaVT). At each step, the model emits a triplet $s_t=(r_t,q_t,z_t)$, crops and re-encodes targeted regions to produce $f_t$, and iterates with the language model to refine its reasoning, formalized with $f_0= ext{V}( ext{I}, \emptyset)$ and $f_t= ext{V}( ext{C}_t,q_t)$. The approach is trained in two stages—supervised fine-tuning on repurposed and synthesized trajectories and reinforcement learning with GRPO using a principled reward that combines correctness and formatting—yielding state-of-the-art performance on 5 of 8 benchmarks and strong gains on multi-image and video-based spatial reasoning. This work improves cross-view grounding, preserves fine-grained visual details, and offers a scalable path toward introspective visual thinking in multimodal AI, with practical implications for high-resolution perception and complex spatial reasoning tasks.

Abstract

Current large vision-language models (LVLMs) typically rely on text-only reasoning based on a single-pass visual encoding, which often leads to loss of fine-grained visual information. Recently the proposal of ''thinking with images'' attempts to alleviate this limitation by manipulating images via external tools or code; however, the resulting visual states are often insufficiently grounded in linguistic semantics, impairing effective cross-modal alignment - particularly when visual semantics or geometric relationships must be reasoned over across distant regions or multiple images. To address these challenges, we propose ''chatting with images'', a new framework that reframes visual manipulation as language-guided feature modulation. Under the guidance of expressive language prompts, the model dynamically performs joint re-encoding over multiple image regions, enabling tighter coupling between linguistic reasoning and visual state updates. We instantiate this paradigm in ViLaVT, a novel LVLM equipped with a dynamic vision encoder explicitly designed for such interactive visual reasoning, and trained it with a two-stage curriculum combining supervised fine-tuning and reinforcement learning to promote effective reasoning behaviors. Extensive experiments across eight benchmarks demonstrate that ViLaVT achieves strong and consistent improvements, with particularly pronounced gains on complex multi-image and video-based spatial reasoning tasks.

Chatting with Images for Introspective Visual Thinking

TL;DR

The paper tackles information loss in large vision-language models by introducing a unified, interactive reasoning paradigm called "chatting with images" and a dedicated dynamic vision encoder (ViLaVT). At each step, the model emits a triplet , crops and re-encodes targeted regions to produce , and iterates with the language model to refine its reasoning, formalized with and . The approach is trained in two stages—supervised fine-tuning on repurposed and synthesized trajectories and reinforcement learning with GRPO using a principled reward that combines correctness and formatting—yielding state-of-the-art performance on 5 of 8 benchmarks and strong gains on multi-image and video-based spatial reasoning. This work improves cross-view grounding, preserves fine-grained visual details, and offers a scalable path toward introspective visual thinking in multimodal AI, with practical implications for high-resolution perception and complex spatial reasoning tasks.

Abstract

Current large vision-language models (LVLMs) typically rely on text-only reasoning based on a single-pass visual encoding, which often leads to loss of fine-grained visual information. Recently the proposal of ''thinking with images'' attempts to alleviate this limitation by manipulating images via external tools or code; however, the resulting visual states are often insufficiently grounded in linguistic semantics, impairing effective cross-modal alignment - particularly when visual semantics or geometric relationships must be reasoned over across distant regions or multiple images. To address these challenges, we propose ''chatting with images'', a new framework that reframes visual manipulation as language-guided feature modulation. Under the guidance of expressive language prompts, the model dynamically performs joint re-encoding over multiple image regions, enabling tighter coupling between linguistic reasoning and visual state updates. We instantiate this paradigm in ViLaVT, a novel LVLM equipped with a dynamic vision encoder explicitly designed for such interactive visual reasoning, and trained it with a two-stage curriculum combining supervised fine-tuning and reinforcement learning to promote effective reasoning behaviors. Extensive experiments across eight benchmarks demonstrate that ViLaVT achieves strong and consistent improvements, with particularly pronounced gains on complex multi-image and video-based spatial reasoning tasks.
Paper Structure (37 sections, 5 equations, 16 figures, 7 tables)

This paper contains 37 sections, 5 equations, 16 figures, 7 tables.

Figures (16)

  • Figure 1: A qualitative comparison of three reasoning paradigms on a multi-view spatial reasoning task. "Thinking about Images" (Left): A static LVLM relies on a one-time visual encoding, resulting in information loss. This leads to a flawed understanding of the objects' spatial relationships and an incorrect answer. "Thinking with Images" (Center): This method invokes an external tool to highlight salient information while re-encoding each image independently. However, this visual prompting technique lacks expressiveness and fails to convey the necessary cognitive intent, resulting in a flawed comparison. "Chatting with Images" (Right): Our model, in contrast, leverages language prompting. The generated inquiry expresses a high-level cognitive intent. This highly expressive, declarative prompt guides a joint visual re-encoding of both images, enabling a relational comparison at the feature level and leading to the correct inference.
  • Figure 2: Left: The iterative reasoning process of ViLaVT; Right: the architecture of the dynamic vision encoder. The "chatting with image" reasoning paradigm unfolds as: (1) Initial Encoding: All input images/frames are initially encoded independently into vision token embeddings. (2) Stepwise Reasoning: The language model generates a triplet $s_t=(r_t, q_t, z_t)$, i.e., an internal thought, a natural language inquiry for visual re-encoding, and a set of target regions. (3) Targeted Re-encoding: Our dynamic vision encoder (Right) takes the textual inquiry $q_t$ and the specified visual regions (cropped and upscaled from source images/frames) as input, which employs a hybrid attention strategy to jointly process vision and text tokens, producing re-encoded vision token embeddings. (4) Iteration: These newly generated vision token embeddings are then passed back to the language model, enriching its context and enabling it to generate the next, more informed reasoning ($s_{t+1}$). This iteration continues until a final answer is reached.
  • Figure 3: The two-stage training pipeline for ViLaVT, including supervised fine-tuning (SFT, Top), followed by reinforcement learning with the GRPO algorithm (RL, Bottom).
  • Figure 4: Vision encoder analysis across resolutions. Our full model exhibits increasingly performance gains over ablations as resolutions decreases, evidencing robustness to information loss.
  • Figure 5: Attention visualization on an HRBench-4K example.
  • ...and 11 more figures