Table of Contents
Fetching ...

Interleaved Latent Visual Reasoning with Selective Perceptual Modeling

Shuai Dong, Siyuan Wang, Xingyu Liu, Zhongyu Wei

TL;DR

ILVR introduces interleaved latent visual reasoning to bridge dynamic state evolution with precise perceptual modeling in multimodal reasoning, avoiding costly pixel-level re-encoding. A Momentum Teacher selectively distills relevant features from helper images to supervise latent representations at each reasoning step, enabling adaptive, context-aware visual cues. The two-stage learning regime first enforces latent alignment with perceptual signals and then relaxes supervision to end-to-end text generation, improving both grounding and reasoning flexibility. Empirical results across COMT, VSP, Zebra-CoT and out-of-distribution benchmarks show ILVR achieving state-of-the-art or near-state-of-the-art performance, with ablations confirming the importance of adaptive selection, latent size, and alignment weighting. Overall, ILVR demonstrates scalable, robust integration of fine-grained perception into evolving, interleaved multimodal reasoning.

Abstract

Interleaved reasoning paradigms enhance Multimodal Large Language Models (MLLMs) with visual feedback but are hindered by the prohibitive computational cost of repeatedly re-encoding pixel-dense images. A promising alternative, latent visual reasoning, circumvents this bottleneck yet currently forces a critical trade-off: methods either sacrifice precise perceptual modeling by over-compressing features or fail to model dynamic problems due to static, non-interleaved structures. We introduce Interleaved Latent Visual Reasoning (ILVR), a framework that unifies dynamic state evolution with precise perceptual modeling. ILVR interleaves textual generation with latent visual representations that act as specific, evolving cues for subsequent reasoning. To enable this, we employ a self-supervision strategy where a Momentum Teacher Model selectively distills relevant features from helper images into sparse supervision targets. This adaptive selection mechanism guides the model to autonomously generate context-aware visual signals. Extensive experiments on multimodal reasoning benchmarks demonstrate that ILVR significantly outperforms existing approaches, effectively bridging the gap between fine-grained perception and sequential multimodal reasoning.

Interleaved Latent Visual Reasoning with Selective Perceptual Modeling

TL;DR

ILVR introduces interleaved latent visual reasoning to bridge dynamic state evolution with precise perceptual modeling in multimodal reasoning, avoiding costly pixel-level re-encoding. A Momentum Teacher selectively distills relevant features from helper images to supervise latent representations at each reasoning step, enabling adaptive, context-aware visual cues. The two-stage learning regime first enforces latent alignment with perceptual signals and then relaxes supervision to end-to-end text generation, improving both grounding and reasoning flexibility. Empirical results across COMT, VSP, Zebra-CoT and out-of-distribution benchmarks show ILVR achieving state-of-the-art or near-state-of-the-art performance, with ablations confirming the importance of adaptive selection, latent size, and alignment weighting. Overall, ILVR demonstrates scalable, robust integration of fine-grained perception into evolving, interleaved multimodal reasoning.

Abstract

Interleaved reasoning paradigms enhance Multimodal Large Language Models (MLLMs) with visual feedback but are hindered by the prohibitive computational cost of repeatedly re-encoding pixel-dense images. A promising alternative, latent visual reasoning, circumvents this bottleneck yet currently forces a critical trade-off: methods either sacrifice precise perceptual modeling by over-compressing features or fail to model dynamic problems due to static, non-interleaved structures. We introduce Interleaved Latent Visual Reasoning (ILVR), a framework that unifies dynamic state evolution with precise perceptual modeling. ILVR interleaves textual generation with latent visual representations that act as specific, evolving cues for subsequent reasoning. To enable this, we employ a self-supervision strategy where a Momentum Teacher Model selectively distills relevant features from helper images into sparse supervision targets. This adaptive selection mechanism guides the model to autonomously generate context-aware visual signals. Extensive experiments on multimodal reasoning benchmarks demonstrate that ILVR significantly outperforms existing approaches, effectively bridging the gap between fine-grained perception and sequential multimodal reasoning.

Paper Structure

This paper contains 26 sections, 10 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Comparison of ILVR with prior latent visual reasoning methods. On a multiple-choice chess puzzle (top row), prior approaches like LVR (a) are limited to representing static details of the initial input (e.g., a zoomed-in view of the rook), failing to model the hypothetical board states required to evaluate different options. On a dense counting task (bottom row), methods relying on heavily compressed latent representations (b) lose fine-grained details, resulting in a hallucinated count. In contrast, our proposed ILVR (c) successfully addresses both tasks by interleaving textual reasoning with dynamically updated latent states. Each latent representation provides specific visual cues essential for facilitating the subsequent reasoning step (visualized by red boxes), unifying dynamic evolution with precise perceptual modeling to arrive at the correct answer.
  • Figure 2: The Interleaved Latent Visual Reasoning (ILVR) framework. The model performs multi-step reasoning by interleaving textual generation with dynamic latent visual representations. Given a multimodal input, the Momentum Teacher Model (bottom) utilizes the current context and latent representations history to generate a Contextual Query ($q_m$), which selectively extracts the most relevant visual patches (yellow blocks) from a helper image. Simultaneously, the model being trained (top) generates a sequence of Latent Representations (pink blocks) interleaved with reasoning text. These generated latents are supervised via a Next-step Latent Alignment objective to match the Momentum-selected key visual features, enabling the model to ground its reasoning in precise, evolving visual evidence.
  • Figure 3: Visualization of dynamic latent modeling. Heatmaps depict the Gaussian-smoothed aggregation of relevant image patches for $K=8$ generated latents. Top: Latents sequentially track the character's path in navigation. Bottom: Visual attention shifts from the object (bread) to the target (plate) during robotic manipulation. This confirms precise alignment between generated latents and the step-wise reasoning context.
  • Figure 4: Impact of Latent Size $K$. Performance trends across VisualLogic, EMMA, and Zebra-CoT (and the overall average) as the number of latent tokens $K$ varies. $\lambda_{\text{sim}}$ is fixed at 1.0. $K=8$ yields the most robust performance across metrics.