Table of Contents
Fetching ...

Learning from All: Concept Alignment for Autonomous Distillation from Multiple Drifting MLLMs

Xiaoyu Yang, Jie Lu, En Yu

TL;DR

This work addresses concept drift in distillation from multiple drifting MLLMs by establishing a theoretical link between drift and multi-teacher ND in the KD process. It proposes autonomous preference optimization (APO) within a learn-compare-critique framework to learn from all teachers while suppressing drift-induced biases, including a formal multi-stream drift model and a KL-based concept-alignment step. A new large-scale chest X-ray reasoning dataset, CXR-MAX, collects 170,982 reasoning trajectories from seven MLLMs on MIMIC-CXR to study multi-teacher dynamics. Empirical results show APO improves consistency, robustness, and generalization, achieving a Top-1 accuracy of $0.76$ on MS-CXR-T (≈13% above the best baseline) and significant gains in diagnostic report generation metrics, while ablations confirm the centrality of APO in mitigating drift. The work advances drift-aware KD for domain-specific multimodal reasoning and provides public data and code to spur further research.

Abstract

This paper identifies a critical yet underexplored challenge in distilling from multimodal large language models (MLLMs): the reasoning trajectories generated by multiple drifting teachers exhibit concept drift, whereby their reasoning distributions evolve unpredictably and transmit biases to the student model, ultimately compromising its performance. To tackle this issue, we pioneer a theoretical connection between concept drift and knowledge distillation, casting the non-stationary reasoning dynamics from multiple MLLM teachers as next-token prediction of multi-stream reasoning trajectories.Guided by concept drift, we introduce the "learn, compare, critique" paradigm, culminating in autonomous preference optimization (APO). Under the active guidance of the teachers, the student model first learns and self-distils preferred thinking by comparing multiple teachers. It then engages in critical reflection over the drifting inference from teachers, performing concept alignment through APO, ultimately yielding a robust, consistent, and generalizable model.Extensive experiments demonstrate our superior performance of consistency, robustness and generalization within knowledge distillation. Besides, we also contributed a large-scale dataset, CXR-MAX (Multi-teachers Alignment X-rays), comprising 170,982 distilled reasoning trajectories derived from publicly accessible MLLMs based on MIMIC-CXR. Our code and data are public at: https://anonymous.4open.science/r/Autonomous-Distillation/.

Learning from All: Concept Alignment for Autonomous Distillation from Multiple Drifting MLLMs

TL;DR

This work addresses concept drift in distillation from multiple drifting MLLMs by establishing a theoretical link between drift and multi-teacher ND in the KD process. It proposes autonomous preference optimization (APO) within a learn-compare-critique framework to learn from all teachers while suppressing drift-induced biases, including a formal multi-stream drift model and a KL-based concept-alignment step. A new large-scale chest X-ray reasoning dataset, CXR-MAX, collects 170,982 reasoning trajectories from seven MLLMs on MIMIC-CXR to study multi-teacher dynamics. Empirical results show APO improves consistency, robustness, and generalization, achieving a Top-1 accuracy of on MS-CXR-T (≈13% above the best baseline) and significant gains in diagnostic report generation metrics, while ablations confirm the centrality of APO in mitigating drift. The work advances drift-aware KD for domain-specific multimodal reasoning and provides public data and code to spur further research.

Abstract

This paper identifies a critical yet underexplored challenge in distilling from multimodal large language models (MLLMs): the reasoning trajectories generated by multiple drifting teachers exhibit concept drift, whereby their reasoning distributions evolve unpredictably and transmit biases to the student model, ultimately compromising its performance. To tackle this issue, we pioneer a theoretical connection between concept drift and knowledge distillation, casting the non-stationary reasoning dynamics from multiple MLLM teachers as next-token prediction of multi-stream reasoning trajectories.Guided by concept drift, we introduce the "learn, compare, critique" paradigm, culminating in autonomous preference optimization (APO). Under the active guidance of the teachers, the student model first learns and self-distils preferred thinking by comparing multiple teachers. It then engages in critical reflection over the drifting inference from teachers, performing concept alignment through APO, ultimately yielding a robust, consistent, and generalizable model.Extensive experiments demonstrate our superior performance of consistency, robustness and generalization within knowledge distillation. Besides, we also contributed a large-scale dataset, CXR-MAX (Multi-teachers Alignment X-rays), comprising 170,982 distilled reasoning trajectories derived from publicly accessible MLLMs based on MIMIC-CXR. Our code and data are public at: https://anonymous.4open.science/r/Autonomous-Distillation/.

Paper Structure

This paper contains 20 sections, 11 equations, 3 figures, 6 tables.

Figures (3)

  • Figure 1: Transmission of Concept Drift behind Distillation of MLLMs
  • Figure 2: The main contributions of our methods. (a) By formalizing the autoregressive inference of MLLM teachers as multi-stream next-token prediction under the lens of concept drift, we reveal that inter-teachers' disturbances of reasoning can propagate to the student via supervised pre-distillation, inducing unpredictable drifts. (b) We propose autonomous preference optimization (APO), leveraging reasoning trajectories carrying explicit conflicts and uncertainties from drifting MLLMs as negative samples, whereas crystallized thinkings via self-distillation as positive signals. Driven by reinforced learning, our model follows a "learn–compare–critique" paradigm to autonomously perform preference alignment, yielding a more robust and generalizable domain-specialized student model. (c) The distribution evolution at different stages is exhibited. Multiple teacher models first map the task distribution to a student-amenable space; the student then learns from this space while simultaneously reflecting on inter-teacher drift, thus autonomously refining itself.
  • Figure :

Theorems & Definitions (2)

  • Definition 2.1
  • Definition 2.2