Table of Contents
Fetching ...

VOLD: Reasoning Transfer from LLMs to Vision-Language Models via On-Policy Distillation

Walid Bousselham, Hilde Kuehne, Cordelia Schmid

TL;DR

VOLD tackles the challenge of transferring complex reasoning from text-only LLMs to vision-language models by proposing a two-stage pipeline that first aligns the student with the teacher via supervised fine-tuning on teacher-generated reasoning traces, then trains with a unified objective that combines Group Relative Policy Optimization and on-policy knowledge distillation. A reward-guided KL masking mechanism mitigates conflicts between RL exploration and distillation, enabling effective knowledge transfer using only text-based data. Empirically, VOLD achieves state-of-the-art performance on diverse multimodal reasoning benchmarks (e.g., MMMU-Pro, MathVision, MathVista, LogicVista) and ablations confirm the critical importance of cold-start policy alignment and the integrated RL+KD signal. The approach demonstrates that abundant text-based reasoning resources can substantially reduce the need for vision-based reasoning data while still delivering strong zero-shot visual reasoning capabilities.

Abstract

Training vision-language models (VLMs) for complex reasoning remains a challenging task, i.a. due to the scarcity of high-quality image-text reasoning data. Conversely, text-based reasoning resources are abundant and scalable, but it is still an open question how to leveraging them for VLM reasoning. To address this problem, we propose VOLD, a framework to transfer reasoning capabilities from text-only teacher models to VLM student models. To this end, VOLD combines reinforcement learning via Group Relative Policy Optimization (GRPO) with on-policy distillation, which allows the student reasoning traces to be guided by the teacher model, resulting in a significant gain over using GRPO alone. We further show that a cold-start alignment is essential for an effective transfer during the online training phase in this scenario and that without sufficient distributional alignment between teacher and student, on-policy distillation fails to provide meaningful guidance. We evaluate VOLD across diverse benchmarks including MMMU-Pro, MathVision, MathVista, and LogicVista, showing that VOLD outperforms the baseline model significantly and improves over the state of the art by a margin. Our ablation shows the importance of a cold-start alignment via SFT for on-policy distillation with a text-only teacher.

VOLD: Reasoning Transfer from LLMs to Vision-Language Models via On-Policy Distillation

TL;DR

VOLD tackles the challenge of transferring complex reasoning from text-only LLMs to vision-language models by proposing a two-stage pipeline that first aligns the student with the teacher via supervised fine-tuning on teacher-generated reasoning traces, then trains with a unified objective that combines Group Relative Policy Optimization and on-policy knowledge distillation. A reward-guided KL masking mechanism mitigates conflicts between RL exploration and distillation, enabling effective knowledge transfer using only text-based data. Empirically, VOLD achieves state-of-the-art performance on diverse multimodal reasoning benchmarks (e.g., MMMU-Pro, MathVision, MathVista, LogicVista) and ablations confirm the critical importance of cold-start policy alignment and the integrated RL+KD signal. The approach demonstrates that abundant text-based reasoning resources can substantially reduce the need for vision-based reasoning data while still delivering strong zero-shot visual reasoning capabilities.

Abstract

Training vision-language models (VLMs) for complex reasoning remains a challenging task, i.a. due to the scarcity of high-quality image-text reasoning data. Conversely, text-based reasoning resources are abundant and scalable, but it is still an open question how to leveraging them for VLM reasoning. To address this problem, we propose VOLD, a framework to transfer reasoning capabilities from text-only teacher models to VLM student models. To this end, VOLD combines reinforcement learning via Group Relative Policy Optimization (GRPO) with on-policy distillation, which allows the student reasoning traces to be guided by the teacher model, resulting in a significant gain over using GRPO alone. We further show that a cold-start alignment is essential for an effective transfer during the online training phase in this scenario and that without sufficient distributional alignment between teacher and student, on-policy distillation fails to provide meaningful guidance. We evaluate VOLD across diverse benchmarks including MMMU-Pro, MathVision, MathVista, and LogicVista, showing that VOLD outperforms the baseline model significantly and improves over the state of the art by a margin. Our ablation shows the importance of a cold-start alignment via SFT for on-policy distillation with a text-only teacher.

Paper Structure

This paper contains 29 sections, 6 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Visual Reasoning Examples. (left) The base model fails the task due to a flawed geometric assumption. (center) The base model trained with SFT+RL only-on text outlines a valid plan but uses an incorrect formula, leading to a wrong answer. (right) The model trained with SFT+RL and guided by on-policy distillation from a teacher LLM successfully navigates the problem. It demonstrates flexible reasoning by considering and then discarding a difficult approach in favor of a more direct and correct one.
  • Figure 2: VOLD training pipeline: VOLD is a two-stage process to instill reasoning capabilities into a student VLM using a text-only teacher. (Stage 1), the student's policy is aligned with the teacher's via SFT on a corpus of teacher-generated reasoning traces. (Stage 2), the student is trained with a unified on-policy objective that leverages the same rollouts to compute both a sparse reward for RL(GRPO) and a dense distillation loss against the teacher. This combined signal enhances reasoning without requiring any vision-based reasoning data. At Inference, the resulting student model can effectively reason over novel image-text prompts.
  • Figure 3: Learning dynamics:(left): Accuracy on the visual Geo3K dataset. (right): Reward on the text-only orz-57k training data. The results show a significant gain by using VOLD.
  • Figure 5: Training reward comparison: VOLD with KL masking (blue), without masking (purple), and vanilla GRPO (red). KL masking provides consistent performance gains throughout training.
  • Figure : (a) Geo3K validation accuracy
  • ...and 2 more figures