Table of Contents
Fetching ...

Training-Free Reasoning and Reflection in MLLMs

Hongchen Wei, Zhenzhong Chen

TL;DR

FRANK tackles the challenge of enabling reasoning in multimodal LLMs without gradient updates by merging a vision grounded MLLM and a reasoning specialized LLM at every decoder layer. It rests on two insights: shallow layers emphasize perception while deep layers emphasize semantics, and task vectors from homologous fine-tuned models are near orthogonal, enabling a layer wise closed-form fusion under NTK. The method introduces modality priors guided by per layer attention to balance visual grounding and reasoning, yielding a training free yet interpretable fusion mechanism. Experiments across MMMU and math and vision tasks show strong gains, with FRANK-38B surpassing strong baselines and approaching the capabilities of GPT-4o on multimodal reasoning. This work provides a practical, scalable route to enhance multimodal intelligence without task specific retraining.

Abstract

Recent advances in Reasoning LLMs (e.g., DeepSeek-R1 and OpenAI-o1) have showcased impressive reasoning capabilities via reinforcement learning. However, extending these capabilities to Multimodal LLMs (MLLMs) is hampered by the prohibitive costs of retraining and the scarcity of high-quality, verifiable multimodal reasoning datasets. This paper introduces FRANK Model, a training-FRee ANd r1-liKe MLLM that imbues off-the-shelf MLLMs with reasoning and reflection abilities, without any gradient updates or extra supervision. Our key insight is to decouple perception and reasoning across MLLM decoder layers. Specifically, we observe that compared to the deeper decoder layers, the shallow decoder layers allocate more attention to visual tokens, while the deeper decoder layers concentrate on textual semantics. This observation motivates a hierarchical weight merging approach that combines a visual-pretrained MLLM with a reasoning-specialized LLM. To this end, we propose a layer-wise, Taylor-derived closed-form fusion mechanism that integrates reasoning capacity into deep decoder layers while preserving visual grounding in shallow decoder layers. Extensive experiments on challenging multimodal reasoning benchmarks demonstrate the effectiveness of our approach. On the MMMU benchmark, our model FRANK-38B achieves an accuracy of 69.2, outperforming the strongest baseline InternVL2.5-38B by +5.3, and even surpasses the proprietary GPT-4o model. Our project homepage is at: http://iip.whu.edu.cn/frank/index.html

Training-Free Reasoning and Reflection in MLLMs

TL;DR

FRANK tackles the challenge of enabling reasoning in multimodal LLMs without gradient updates by merging a vision grounded MLLM and a reasoning specialized LLM at every decoder layer. It rests on two insights: shallow layers emphasize perception while deep layers emphasize semantics, and task vectors from homologous fine-tuned models are near orthogonal, enabling a layer wise closed-form fusion under NTK. The method introduces modality priors guided by per layer attention to balance visual grounding and reasoning, yielding a training free yet interpretable fusion mechanism. Experiments across MMMU and math and vision tasks show strong gains, with FRANK-38B surpassing strong baselines and approaching the capabilities of GPT-4o on multimodal reasoning. This work provides a practical, scalable route to enhance multimodal intelligence without task specific retraining.

Abstract

Recent advances in Reasoning LLMs (e.g., DeepSeek-R1 and OpenAI-o1) have showcased impressive reasoning capabilities via reinforcement learning. However, extending these capabilities to Multimodal LLMs (MLLMs) is hampered by the prohibitive costs of retraining and the scarcity of high-quality, verifiable multimodal reasoning datasets. This paper introduces FRANK Model, a training-FRee ANd r1-liKe MLLM that imbues off-the-shelf MLLMs with reasoning and reflection abilities, without any gradient updates or extra supervision. Our key insight is to decouple perception and reasoning across MLLM decoder layers. Specifically, we observe that compared to the deeper decoder layers, the shallow decoder layers allocate more attention to visual tokens, while the deeper decoder layers concentrate on textual semantics. This observation motivates a hierarchical weight merging approach that combines a visual-pretrained MLLM with a reasoning-specialized LLM. To this end, we propose a layer-wise, Taylor-derived closed-form fusion mechanism that integrates reasoning capacity into deep decoder layers while preserving visual grounding in shallow decoder layers. Extensive experiments on challenging multimodal reasoning benchmarks demonstrate the effectiveness of our approach. On the MMMU benchmark, our model FRANK-38B achieves an accuracy of 69.2, outperforming the strongest baseline InternVL2.5-38B by +5.3, and even surpasses the proprietary GPT-4o model. Our project homepage is at: http://iip.whu.edu.cn/frank/index.html

Paper Structure

This paper contains 35 sections, 45 equations, 11 figures, 7 tables.

Figures (11)

  • Figure 1: Non-reasoning MLLMs lack reasoning and reflection abilities, while reasoning LLMs are unable to perceive visual information. We propose a training-free, closed-form layerwise fusion method that combines visual perception and language reasoning strengths, substantially enhancing overall reasoning capability in multimodal settings.
  • Figure 2: Layer-wise visual attention of NVIL-15B. Each curve shows the average attention from a text token to all visual tokens across layers. Shallow layers assign significantly higher attention to visual tokens, while attention in deeper layers approaches zero and rapidly descends indicating a shift from perception to language reasoning. This supports our use of an exponential decay prior to the fusion process.
  • Figure 3: Cosine similarity between task vectors of vision-finetuned (NVIL-15B) and reasoning-finetuned (DeepSeekDistil-Qwen2.5-14B) models at each decoder block. The task vector at each block is computed by flattening the weight deltas with respect to the base model. The similarity remains close to 0 across all layers, indicating strong near-orthogonality.
  • Figure 4: Average output length of the FRANK on the MMMU benchmark, stratified by task difficulty.
  • Figure 5: Output examples from FRANK-8B and the non-reasoning baseline model Idecifics3-8B. Here, <think> and </think> denote R1-like reasoning processes, while blue text indicates reflection tokens.
  • ...and 6 more figures

Theorems & Definitions (2)

  • Definition 1: Layer-Wise Task Loss Difference, LTLD
  • Definition 2: Layer-Wise Average Loss Difference, LALD