Table of Contents
Fetching ...

Reasoning-Aligned Perception Decoupling for Scalable Multi-modal Reasoning

Yunhao Gou, Kai Chen, Zhili Liu, Lanqing Hong, Xin Jin, Zhenguo Li, James T. Kwok, Yu Zhang

TL;DR

RAPID tackles the bottleneck in multi-modal reasoning by decoupling perception (MLLM-based captioning) from reasoning (external text-only LLMs), enabling rapid scaling of reasoning power without costly vision-language re-alignments. It introduces Visual Perception Optimization (VPO), a policy-gradient scheme that uses the external reasoner’s correctness as a reward to produce reasoning-aligned captions, and combines this with a GRPO-style objective to refine the perception module. Empirically, RAPID yields significant gains across diverse multi-modal reasoning benchmarks and enables an inference-time scaling paradigm where a single optimized MLLM can be paired with ever-stronger LLMs to improve performance without retraining. The approach also preserves general, non-thinking capabilities and offers an LLM-agnostic, plug-and-play pathway for continual improvement in multi-modal AI systems.

Abstract

Recent breakthroughs in reasoning language models have significantly advanced text-based reasoning. On the other hand, Multi-modal Large Language Models (MLLMs) still lag behind, hindered by their outdated internal LLMs. Upgrading these is often prohibitively expensive, as it requires complete vision-language alignment retraining which is costly. To address this issue, we introduce Perception-Reasoning Decoupling, which modularizes the MLLM's reasoning component and makes it easily replaceable. This approach redefines the MLLM's role to convert multi-modal inputs into detailed textual outputs that can be processed by any powerful, external, text-only LLM reasoners. To align the MLLM's perceptual output with the final reasoning task, we propose a novel reinforcement learning algorithm called Visual Perception Optimization (VPO). VPO rewards the MLLM based on the correctness of answers generated by the external reasoner to produce faithful and query-relevant captions. Together, this decoupling pipeline and VPO form our Reasoning-Aligned PerceptIon Decoupling (RAPID) approach. Empirical results show that RAPID achieves significant performance gains on multi-modal reasoning benchmarks. Crucially, RAPID enables a novel inference-time scaling paradigm: Once trained with VPO, the MLLM can be paired with any state-of-the-art LLM reasoner for consistent performance improvement without retraining.

Reasoning-Aligned Perception Decoupling for Scalable Multi-modal Reasoning

TL;DR

RAPID tackles the bottleneck in multi-modal reasoning by decoupling perception (MLLM-based captioning) from reasoning (external text-only LLMs), enabling rapid scaling of reasoning power without costly vision-language re-alignments. It introduces Visual Perception Optimization (VPO), a policy-gradient scheme that uses the external reasoner’s correctness as a reward to produce reasoning-aligned captions, and combines this with a GRPO-style objective to refine the perception module. Empirically, RAPID yields significant gains across diverse multi-modal reasoning benchmarks and enables an inference-time scaling paradigm where a single optimized MLLM can be paired with ever-stronger LLMs to improve performance without retraining. The approach also preserves general, non-thinking capabilities and offers an LLM-agnostic, plug-and-play pathway for continual improvement in multi-modal AI systems.

Abstract

Recent breakthroughs in reasoning language models have significantly advanced text-based reasoning. On the other hand, Multi-modal Large Language Models (MLLMs) still lag behind, hindered by their outdated internal LLMs. Upgrading these is often prohibitively expensive, as it requires complete vision-language alignment retraining which is costly. To address this issue, we introduce Perception-Reasoning Decoupling, which modularizes the MLLM's reasoning component and makes it easily replaceable. This approach redefines the MLLM's role to convert multi-modal inputs into detailed textual outputs that can be processed by any powerful, external, text-only LLM reasoners. To align the MLLM's perceptual output with the final reasoning task, we propose a novel reinforcement learning algorithm called Visual Perception Optimization (VPO). VPO rewards the MLLM based on the correctness of answers generated by the external reasoner to produce faithful and query-relevant captions. Together, this decoupling pipeline and VPO form our Reasoning-Aligned PerceptIon Decoupling (RAPID) approach. Empirical results show that RAPID achieves significant performance gains on multi-modal reasoning benchmarks. Crucially, RAPID enables a novel inference-time scaling paradigm: Once trained with VPO, the MLLM can be paired with any state-of-the-art LLM reasoner for consistent performance improvement without retraining.

Paper Structure

This paper contains 46 sections, 4 equations, 23 figures, 8 tables.

Figures (23)

  • Figure 1: Comparisons on multi-modal reasoning benchmarks on average performance and total model size between RAPID-enhanced Qwen2.5-VL series of models and the other existing MLLMs. Check the detailed numerical results in Appendix \ref{['app:teaser']} and experimental settings in Sec. \ref{['sec:exp_main']}.
  • Figure 2: Comparisons between RAPID and existing alignment methods for reasoning MLLMs. For novel LLMs, existing methods (a) repeatedly conduct the compute-intensive alignment procedure, while (b) RAPID decouples the visual perception from text-only reasoning (Sec. \ref{['sec:method_offload']}) by learning to extract reasoning-aligned visual contexts with the proposed VPO algorithm (Sec. \ref{['sec:method_caption']}). Note that the caption penalty, as in Eq. \ref{['eq:cap_penalty']}, is omitted here for simplicity. The flame and snowflake icons indicate the models are trainable and frozen, respectively, during the process.
  • Figure 3: Comparison of the strategies for visual perception $O_p$.
  • Figure 4: Visual Perception Optimization (VPO) reinforces captions that induce correct reasoning results via reinforcement learning with verifiable rewards. Here we omit caption penalty for simplicity.
  • Figure 5: General benchmark Results.
  • ...and 18 more figures