Table of Contents
Fetching ...

Vision-aligned Latent Reasoning for Multi-modal Large Language Model

Byungwoo Jeon, Yoonwoo Jeong, Hyunseok Lee, Minsu Cho, Jinwoo Shin

TL;DR

This work tackles the problem of visual information dilution in long-context multi-modal reasoning by introducing Vision-aligned Latent Reasoning (VaLR). VaLR dynamically injects vision-aligned latent tokens before each Chain-of-Thought step and employs a two-stage curriculum with a representation-alignment objective (REPA) to align MLLM intermediate states with dense visual features from external vision encoders, including multi-encoder fusion. Empirically, VaLR delivers strong gains on 3D spatial reasoning and perception benchmarks, notably achieving a VSI-Bench average of 52.9% with multi-encoder alignment and demonstrating test-time scaling where performance grows with reasoning length, unlike prior approaches. The method is encoder-agnostic and data-efficient, offering a practical path to robust long-context multi-modal reasoning in vision-language tasks and agentic applications.

Abstract

Despite recent advancements in Multi-modal Large Language Models (MLLMs) on diverse understanding tasks, these models struggle to solve problems which require extensive multi-step reasoning. This is primarily due to the progressive dilution of visual information during long-context generation, which hinders their ability to fully exploit test-time scaling. To address this issue, we introduce Vision-aligned Latent Reasoning (VaLR), a simple, yet effective reasoning framework that dynamically generates vision-aligned latent tokens before each Chain of Thought reasoning step, guiding the model to reason based on perceptual cues in the latent space. Specifically, VaLR is trained to preserve visual knowledge during reasoning by aligning intermediate embeddings of MLLM with those from vision encoders. Empirical results demonstrate that VaLR consistently outperforms existing approaches across a wide range of benchmarks requiring long-context understanding or precise visual perception, while exhibiting test-time scaling behavior not observed in prior MLLMs. In particular, VaLR improves the performance significantly from 33.0% to 52.9% on VSI-Bench, achieving a 19.9%p gain over Qwen2.5-VL.

Vision-aligned Latent Reasoning for Multi-modal Large Language Model

TL;DR

This work tackles the problem of visual information dilution in long-context multi-modal reasoning by introducing Vision-aligned Latent Reasoning (VaLR). VaLR dynamically injects vision-aligned latent tokens before each Chain-of-Thought step and employs a two-stage curriculum with a representation-alignment objective (REPA) to align MLLM intermediate states with dense visual features from external vision encoders, including multi-encoder fusion. Empirically, VaLR delivers strong gains on 3D spatial reasoning and perception benchmarks, notably achieving a VSI-Bench average of 52.9% with multi-encoder alignment and demonstrating test-time scaling where performance grows with reasoning length, unlike prior approaches. The method is encoder-agnostic and data-efficient, offering a practical path to robust long-context multi-modal reasoning in vision-language tasks and agentic applications.

Abstract

Despite recent advancements in Multi-modal Large Language Models (MLLMs) on diverse understanding tasks, these models struggle to solve problems which require extensive multi-step reasoning. This is primarily due to the progressive dilution of visual information during long-context generation, which hinders their ability to fully exploit test-time scaling. To address this issue, we introduce Vision-aligned Latent Reasoning (VaLR), a simple, yet effective reasoning framework that dynamically generates vision-aligned latent tokens before each Chain of Thought reasoning step, guiding the model to reason based on perceptual cues in the latent space. Specifically, VaLR is trained to preserve visual knowledge during reasoning by aligning intermediate embeddings of MLLM with those from vision encoders. Empirical results demonstrate that VaLR consistently outperforms existing approaches across a wide range of benchmarks requiring long-context understanding or precise visual perception, while exhibiting test-time scaling behavior not observed in prior MLLMs. In particular, VaLR improves the performance significantly from 33.0% to 52.9% on VSI-Bench, achieving a 19.9%p gain over Qwen2.5-VL.
Paper Structure (27 sections, 9 equations, 5 figures, 12 tables)

This paper contains 27 sections, 9 equations, 5 figures, 12 tables.

Figures (5)

  • Figure 1: Overview of VaLR. Our framework, VaLR, generates vision-aligned latent tokens and language tokens throughout reasoning process. (a) During latent token generation, the last hidden states of MLLM becomes input embedding for the next token prediction. (b) To train the latent token generation, we align the intermediate features of MLLM with pre-trained visual representation extracted from external vision encoders. Note that we do not use the external vision encoder at test-time.
  • Figure 2: Reasoning length-wise analysis. We investigate the effect of reasoning length on model performance across different MLLMs. We report hallucination rate on MMhalu sun2024mmhalu benchmark and accuracy (%) on MathVista lu2023mathvista, MathVision wang2024mathvision, and MMVP tong2024mmvp benchmark. For MMhalu, lower is better. We observe that VaLR is the only method that exhibits consistent performance improvements as reasoning length increases, while remaining robust on long-horizon tasks.
  • Figure 3: Effect of Data Scalability. We investigate the effect of the size of data and evaluate on VSI-Bench, BLINK, and V$^*$ benchmark. Results are marked 10K, 50K, 100K, 200K, and 450K sample size with fixed iterations. The result show consistent and scalable performance improvements with increased data size across all benchmarks. Notably, VaLR achieves $>$20x faster convergence than vanilla SFT model on V$^*$ benchmark.
  • Figure 4: Comparison between methods using vision encoder features. We compare two methods using DINOv3 features: (a) Using visual features as input visual tokens of MLLM (Green), (b) Aligning visual features with MLLM embeddings (Red). We report accuracy (%) on VSI-Bench, BLINK, and V$^*$ benchmark.
  • Figure 5: Feature Visualization.