Interleaved Latent Visual Reasoning with Selective Perceptual Modeling

Shuai Dong; Siyuan Wang; Xingyu Liu; Zhongyu Wei

Interleaved Latent Visual Reasoning with Selective Perceptual Modeling

Shuai Dong, Siyuan Wang, Xingyu Liu, Zhongyu Wei

TL;DR

ILVR introduces interleaved latent visual reasoning to bridge dynamic state evolution with precise perceptual modeling in multimodal reasoning, avoiding costly pixel-level re-encoding. A Momentum Teacher selectively distills relevant features from helper images to supervise latent representations at each reasoning step, enabling adaptive, context-aware visual cues. The two-stage learning regime first enforces latent alignment with perceptual signals and then relaxes supervision to end-to-end text generation, improving both grounding and reasoning flexibility. Empirical results across COMT, VSP, Zebra-CoT and out-of-distribution benchmarks show ILVR achieving state-of-the-art or near-state-of-the-art performance, with ablations confirming the importance of adaptive selection, latent size, and alignment weighting. Overall, ILVR demonstrates scalable, robust integration of fine-grained perception into evolving, interleaved multimodal reasoning.

Abstract

Interleaved reasoning paradigms enhance Multimodal Large Language Models (MLLMs) with visual feedback but are hindered by the prohibitive computational cost of repeatedly re-encoding pixel-dense images. A promising alternative, latent visual reasoning, circumvents this bottleneck yet currently forces a critical trade-off: methods either sacrifice precise perceptual modeling by over-compressing features or fail to model dynamic problems due to static, non-interleaved structures. We introduce Interleaved Latent Visual Reasoning (ILVR), a framework that unifies dynamic state evolution with precise perceptual modeling. ILVR interleaves textual generation with latent visual representations that act as specific, evolving cues for subsequent reasoning. To enable this, we employ a self-supervision strategy where a Momentum Teacher Model selectively distills relevant features from helper images into sparse supervision targets. This adaptive selection mechanism guides the model to autonomously generate context-aware visual signals. Extensive experiments on multimodal reasoning benchmarks demonstrate that ILVR significantly outperforms existing approaches, effectively bridging the gap between fine-grained perception and sequential multimodal reasoning.

Interleaved Latent Visual Reasoning with Selective Perceptual Modeling

TL;DR

Abstract

Interleaved Latent Visual Reasoning with Selective Perceptual Modeling

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (4)