Table of Contents
Fetching ...

Uncertainty-Aware Knowledge Distillation for Multimodal Large Language Models

Jingchen Sun, Shaobo Han, Deep Patel, Wataru Kohno, Can Jin, Changyou Chen

Abstract

Knowledge distillation establishes a learning paradigm that leverages both data supervision and teacher guidance. However, determining the optimal balance between learning from data and learning from the teacher is challenging, as some samples may be noisy while others are subject to teacher uncertainty. This motivates the need for adaptively balancing data and teacher supervision. We propose Beta-weighted Knowledge Distillation (Beta-KD), an uncertainty-aware distillation framework that adaptively modulates how much the student relies on teacher guidance. Specifically, we formulate teacher--student learning from a unified Bayesian perspective and interpret teacher supervision as a Gibbs prior over student activations. This yields a closed-form, uncertainty-aware weighting mechanism and supports arbitrary distillation objectives and their combinations. Extensive experiments on multimodal VQA benchmarks demonstrate that distilling student Vision-Language Models from a large teacher VLM consistently improves performance. The results show that Beta-KD outperforms existing knowledge distillation methods. The code is available at https://github.com/Jingchensun/beta-kd.

Uncertainty-Aware Knowledge Distillation for Multimodal Large Language Models

Abstract

Knowledge distillation establishes a learning paradigm that leverages both data supervision and teacher guidance. However, determining the optimal balance between learning from data and learning from the teacher is challenging, as some samples may be noisy while others are subject to teacher uncertainty. This motivates the need for adaptively balancing data and teacher supervision. We propose Beta-weighted Knowledge Distillation (Beta-KD), an uncertainty-aware distillation framework that adaptively modulates how much the student relies on teacher guidance. Specifically, we formulate teacher--student learning from a unified Bayesian perspective and interpret teacher supervision as a Gibbs prior over student activations. This yields a closed-form, uncertainty-aware weighting mechanism and supports arbitrary distillation objectives and their combinations. Extensive experiments on multimodal VQA benchmarks demonstrate that distilling student Vision-Language Models from a large teacher VLM consistently improves performance. The results show that Beta-KD outperforms existing knowledge distillation methods. The code is available at https://github.com/Jingchensun/beta-kd.
Paper Structure (20 sections, 1 theorem, 23 equations, 6 figures, 6 tables)

This paper contains 20 sections, 1 theorem, 23 equations, 6 figures, 6 tables.

Key Result

Theorem 1

Let the teacher-informed prior be a Gibbs distribution and assume $p(y\mid a^s,a^t,\beta)=p(y\mid a^s)$. Then maximizing the posterior $p(a^s\mid y,a^t,\beta)$ is equivalent to minimizing the knowledge distillation objective In particular, since $a^s$ is deterministically induced by $\theta$ via $a^s=a^s(x;\theta)$, optimizing the student activation corresponds to minimizing this objective w.r.t

Figures (6)

  • Figure 1: Overview of the proposed Beta-KD framework. (a) Conventional KD is hard to balance the learning from data and the learning from teacher signals. (b) Our method introduces an uncertainty-aware weighting framework by recognizing teacher supervision as a Gibbs prior, which naturally induces the prediction of the weights $\beta_1$ and $\beta_2$ through an amortized optimization network. The predicted uncertainty weights dynamically modulate the learning strength between teacher and student alignment, enabling adaptive balancing without manual hyperparameter tuning.
  • Figure 2: Language modeling chain with teacher guidance. Given input $x$, the student network produces activations $\mathbf{f}^s$ or $\mathbf{z}^s$, which are mapped to probabilities $\mathbf{q}^s$ via softmax and then sampled to generate output $y$. Dashed arrows indicate teacher supervision injected at (1) the feature/logit level ($\mathbf{f}^t$ or $\mathbf{z}^t$) or (2) the probability level ($\mathbf{p}_t^\tau$).
  • Figure 3: Visualization of four representative knowledge distillation losses in the probability simplex.
  • Figure 4: Training trajectories and dynamic weight evolution for FKL+CE and RKL+CE objectives. The upper row shows the total training loss over steps, and the lower row illustrates the adaptive evolution of task and instance-level uncertainty weights $\beta$. The adaptive adjustment of the weighting parameter $\beta$ during training ensure a faster overall loss convergence and enhances optimization stability.
  • Figure 5: Visualization of teacher–student logit distributions at different training stages. Step10 and Step190 denote early and late training checkpoints. Compare with the training steps, both Beta-KD (Task) and Beta-KD (Instance) reduce the logit matching distance compared to the baseline, with the instance-level variant achieving the closest alignment.
  • ...and 1 more figures

Theorems & Definitions (1)

  • Theorem 1: Energy--Bayes Equivalence