Uncertainty-Aware Knowledge Distillation for Multimodal Large Language Models

Jingchen Sun; Shaobo Han; Deep Patel; Wataru Kohno; Can Jin; Changyou Chen

Uncertainty-Aware Knowledge Distillation for Multimodal Large Language Models

Jingchen Sun, Shaobo Han, Deep Patel, Wataru Kohno, Can Jin, Changyou Chen

Abstract

Knowledge distillation establishes a learning paradigm that leverages both data supervision and teacher guidance. However, determining the optimal balance between learning from data and learning from the teacher is challenging, as some samples may be noisy while others are subject to teacher uncertainty. This motivates the need for adaptively balancing data and teacher supervision. We propose Beta-weighted Knowledge Distillation (Beta-KD), an uncertainty-aware distillation framework that adaptively modulates how much the student relies on teacher guidance. Specifically, we formulate teacher--student learning from a unified Bayesian perspective and interpret teacher supervision as a Gibbs prior over student activations. This yields a closed-form, uncertainty-aware weighting mechanism and supports arbitrary distillation objectives and their combinations. Extensive experiments on multimodal VQA benchmarks demonstrate that distilling student Vision-Language Models from a large teacher VLM consistently improves performance. The results show that Beta-KD outperforms existing knowledge distillation methods. The code is available at https://github.com/Jingchensun/beta-kd.

Uncertainty-Aware Knowledge Distillation for Multimodal Large Language Models

Abstract

Paper Structure (20 sections, 1 theorem, 23 equations, 6 figures, 6 tables)

This paper contains 20 sections, 1 theorem, 23 equations, 6 figures, 6 tables.

Introduction
Related Work
KD in Multimodal LLMs.
Uncertainty-based Loss Balancing.
Uncertainty-Aware Knowledge Distillation
Preliminaries
A Bayesian View of Knowledge Distillation
Teacher-Informed Gibbs Prior.
MAP of Student Activation.
Laplace Approximation.
Amortized Optimization on $\beta$.
Experiments
Experiment Setup
The Design Space of Energy-Bayes KD
Effectiveness of Uncertainty Weighting
...and 5 more sections

Key Result

Theorem 1

Let the teacher-informed prior be a Gibbs distribution and assume $p(y\mid a^s,a^t,\beta)=p(y\mid a^s)$. Then maximizing the posterior $p(a^s\mid y,a^t,\beta)$ is equivalent to minimizing the knowledge distillation objective In particular, since $a^s$ is deterministically induced by $\theta$ via $a^s=a^s(x;\theta)$, optimizing the student activation corresponds to minimizing this objective w.r.t

Figures (6)

Figure 1: Overview of the proposed Beta-KD framework. (a) Conventional KD is hard to balance the learning from data and the learning from teacher signals. (b) Our method introduces an uncertainty-aware weighting framework by recognizing teacher supervision as a Gibbs prior, which naturally induces the prediction of the weights $\beta_1$ and $\beta_2$ through an amortized optimization network. The predicted uncertainty weights dynamically modulate the learning strength between teacher and student alignment, enabling adaptive balancing without manual hyperparameter tuning.
Figure 2: Language modeling chain with teacher guidance. Given input $x$, the student network produces activations $\mathbf{f}^s$ or $\mathbf{z}^s$, which are mapped to probabilities $\mathbf{q}^s$ via softmax and then sampled to generate output $y$. Dashed arrows indicate teacher supervision injected at (1) the feature/logit level ($\mathbf{f}^t$ or $\mathbf{z}^t$) or (2) the probability level ($\mathbf{p}_t^\tau$).
Figure 3: Visualization of four representative knowledge distillation losses in the probability simplex.
Figure 4: Training trajectories and dynamic weight evolution for FKL+CE and RKL+CE objectives. The upper row shows the total training loss over steps, and the lower row illustrates the adaptive evolution of task and instance-level uncertainty weights $\beta$. The adaptive adjustment of the weighting parameter $\beta$ during training ensure a faster overall loss convergence and enhances optimization stability.
Figure 5: Visualization of teacher–student logit distributions at different training stages. Step10 and Step190 denote early and late training checkpoints. Compare with the training steps, both Beta-KD (Task) and Beta-KD (Instance) reduce the logit matching distance compared to the baseline, with the instance-level variant achieving the closest alignment.
...and 1 more figures

Theorems & Definitions (1)

Theorem 1: Energy--Bayes Equivalence

Uncertainty-Aware Knowledge Distillation for Multimodal Large Language Models

Abstract

Uncertainty-Aware Knowledge Distillation for Multimodal Large Language Models

Authors

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (1)