Table of Contents
Fetching ...

What does RL improve for Visual Reasoning? A Frankenstein-Style Analysis

Xirui Li, Ming Li, Tianyi Zhou

TL;DR

This paper investigates what reinforcement learning with verifiable rewards actually improves in visual reasoning for vision-language models. It introduces a Frankenstein-style analysis combining causal functional localization, parameter-update geometry, and region-wise model merging to attribute RL gains to mid-late transformer refinements. The authors find that RL shifts inference notably in mid-to-late layers, improves vision-to-reasoning alignment and reasoning, and that these refinements transfer across models and are necessary for gains, rather than uniformly boosting visual perception. The work highlights the limits of benchmark-only evaluation for multimodal reasoning and provides a framework to diagnose internal changes driving progress.

Abstract

Reinforcement learning (RL) with verifiable rewards has become a standard post-training stage for boosting visual reasoning in vision-language models, yet it remains unclear what capabilities RL actually improves compared with supervised fine-tuning as cold-start initialization (IN). End-to-end benchmark gains conflate multiple factors, making it difficult to attribute improvements to specific skills. To bridge the gap, we propose a Frankenstein-style analysis framework including: (i) functional localization via causal probing; (ii) update characterization via parameter comparison; and (iii) transferability test via model merging. Instead, RL induces a consistent inference-time shift primarily in mid-to-late layers, and these mid-to-late refinements are both transferable (via merging) and necessary (via freezing) for RL gains. Overall, our results suggest that RL's reliable contribution in visual reasoning is not a uniform enhancement of visual perception, but a systematic refinement of mid-to-late transformer computation that improves vision-to-reasoning alignment and reasoning performance, highlighting the limitations of benchmark-only evaluation for understanding multimodal reasoning improvements.

What does RL improve for Visual Reasoning? A Frankenstein-Style Analysis

TL;DR

This paper investigates what reinforcement learning with verifiable rewards actually improves in visual reasoning for vision-language models. It introduces a Frankenstein-style analysis combining causal functional localization, parameter-update geometry, and region-wise model merging to attribute RL gains to mid-late transformer refinements. The authors find that RL shifts inference notably in mid-to-late layers, improves vision-to-reasoning alignment and reasoning, and that these refinements transfer across models and are necessary for gains, rather than uniformly boosting visual perception. The work highlights the limits of benchmark-only evaluation for multimodal reasoning and provides a framework to diagnose internal changes driving progress.

Abstract

Reinforcement learning (RL) with verifiable rewards has become a standard post-training stage for boosting visual reasoning in vision-language models, yet it remains unclear what capabilities RL actually improves compared with supervised fine-tuning as cold-start initialization (IN). End-to-end benchmark gains conflate multiple factors, making it difficult to attribute improvements to specific skills. To bridge the gap, we propose a Frankenstein-style analysis framework including: (i) functional localization via causal probing; (ii) update characterization via parameter comparison; and (iii) transferability test via model merging. Instead, RL induces a consistent inference-time shift primarily in mid-to-late layers, and these mid-to-late refinements are both transferable (via merging) and necessary (via freezing) for RL gains. Overall, our results suggest that RL's reliable contribution in visual reasoning is not a uniform enhancement of visual perception, but a systematic refinement of mid-to-late transformer computation that improves vision-to-reasoning alignment and reasoning performance, highlighting the limitations of benchmark-only evaluation for understanding multimodal reasoning improvements.
Paper Structure (56 sections, 6 equations, 9 figures, 6 tables)

This paper contains 56 sections, 6 equations, 9 figures, 6 tables.

Figures (9)

  • Figure 1: Frankenstein-style Analysis Framework. The framework proceeds through three components: (1) functional localization via causal probing across transformer depth, (2) update characterization via parameter comparison to identify region-wise update pattern in post-training , and (3) transferability test via model merging, assessing whether the localized functionalities are preserved in layers.
  • Figure 2: Average Benchmark Accuracy versus Fine-Grained Abilities (Vision, Reasoning, and Vision-to-Reasoning Alignment). The green arrows denote model post-training pipelines (Base Model $\rightarrow$ IN Model $\rightarrow$ RL Model) that exhibit monotonic performance gains, whereas the purple arrows indicate model groups that do not. Despite apparent improvements on visual reasoning benchmarks, fine-grained evaluation metrics reveal that vision ability and reasoning ability do not improve monotonically from the base model to the IN model and then to the RL model.
  • Figure 3: Aggregated Attention from Reasoning Tokens to Vision Tokens. Compared to IN models, there is more attention from reasoning tokens to vision tokens in RL models' inference. The pattern is concentrated in later layers across training recipes, while absent in earlier layers.
  • Figure 4: Layer-wise functional localization of vision and reasoning. Both plots indicate the relative importance of each layer for the evaluated task. Vision-related functionalities are primarily associated with Early and Mid transformer layers, whereas reasoning-related computations are concentrated in Late layers.
  • Figure 5: Layer-wise parameter update norms comparison between IN and RL. Per-layer Frobenius norms of parameter updates for IN (solid) and RL (dashed). Both training stages concentrate on optimization in the Mid layers, while RL exhibits a distinct redistribution of update magnitude compared to IN.
  • ...and 4 more figures