Table of Contents
Fetching ...

Beyond Single-Sample: Reliable Multi-Sample Distillation for Video Understanding

Songlin Li, Xin Zhu, Zechao Guan, Peipeng Chen, Jian Yao

Abstract

Traditional black-box distillation for Large Vision-Language Models (LVLMs) typically relies on a single teacher response per input, which often yields high-variance responses and format inconsistencies in multimodal or temporal scenarios. To mitigate this unreliable supervision, we propose R-MSD (Reliable Multi-Sample Distillation), a framework that explicitly models teacher sampling variance to enhance distillation stability. Rather than relying on a single teacher response, our approach leverages a task-adaptive teacher pool to provide robust supervision tailored to both closed-ended and open-ended reasoning. By integrating quality-aware signal matching with an adversarial distillation objective, our approach effectively filters teacher noise while maximizing knowledge transfer. Extensive evaluations across comprehensive video understanding benchmarks demonstrate that R-MSD consistently outperforms single sample distillation methods. We additionally include an original SFT+RL 4B baseline under the same training budget, which shows only marginal gains, while our method achieves significant improvements. With a 4B student model, our approach delivers gains on VideoMME (+1.5%), Video-MMMU (+3.2%), and MathVerse (+3.6%).

Beyond Single-Sample: Reliable Multi-Sample Distillation for Video Understanding

Abstract

Traditional black-box distillation for Large Vision-Language Models (LVLMs) typically relies on a single teacher response per input, which often yields high-variance responses and format inconsistencies in multimodal or temporal scenarios. To mitigate this unreliable supervision, we propose R-MSD (Reliable Multi-Sample Distillation), a framework that explicitly models teacher sampling variance to enhance distillation stability. Rather than relying on a single teacher response, our approach leverages a task-adaptive teacher pool to provide robust supervision tailored to both closed-ended and open-ended reasoning. By integrating quality-aware signal matching with an adversarial distillation objective, our approach effectively filters teacher noise while maximizing knowledge transfer. Extensive evaluations across comprehensive video understanding benchmarks demonstrate that R-MSD consistently outperforms single sample distillation methods. We additionally include an original SFT+RL 4B baseline under the same training budget, which shows only marginal gains, while our method achieves significant improvements. With a 4B student model, our approach delivers gains on VideoMME (+1.5%), Video-MMMU (+3.2%), and MathVerse (+3.6%).
Paper Structure (41 sections, 8 equations, 3 figures, 5 tables)

This paper contains 41 sections, 8 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Teacher sampling variance in video understanding. Left: Temporal QA example: four teacher responses with their temporal IoU to ground truth, showing substantial quality variation across samples of the same question. Right: Global statistics over 200 teacher responses from eight task types: (a) cross-question variance and (b) within-question sampling variance. Both dimensions undermine single-sample distillation and motivate our multi-sample approach.
  • Figure 2: Overview of the R-MSD framework (Stage 2: RL-based Adversarial Distillation). The pipeline utilizes a multi-sample teacher collection to provide diverse references. A critic-as-discri-minator mechanism is employed to perform task-adaptive sampling, matching student rollouts with teacher responses. The student policy is then iteratively refined through adversarial distillation, using a weighted combination of discriminator, format, and task-specific scores as the reward signal.
  • Figure 3: Per-task teacher variance and Pass@$k$ behavior on Video-MMMU. Left: Teacher quality distribution by task type; box plots over 200 teacher responses across six task families highlight that high-stability closed-ended tasks (e.g., MCQ, numerical) have lower variance, while higher-variance tasks such as visual QA exhibit wider spreads. Right: Pass@$k$ accuracy on Video-MMMU as a function of $k$ (number of sampled student responses per query); our method achieves +3.2% higher Pass@1 accuracy than Qwen3-VL-4B while converging to a similar upper bound as $k$ increases.