Table of Contents
Fetching ...

Wisdom of Committee: Distilling from Foundation Model to Specialized Application Model

Zichang Liu, Qingyun Liu, Yuening Li, Liang Liu, Anshumali Shrivastava, Shuchao Bi, Lichan Hong, Ed H. Chi, Zhe Zhao

TL;DR

This work tackles the challenge of transferring knowledge from large foundation models to specialized, efficient downstream systems by bridging substantial gaps in capacity, architecture, distributions, and modalities. It introduces DiverseDistill, a general distillation framework featuring a Distillation Module with a Question Augmenter and an Answer Augmenter that enables interactive, knowledge-rich communication between a student and a diverse teacher committee including complementary teachers. The approach is evaluated on both recommendation and vision tasks, showing that complementary teachers and the DiverseDistill mechanism consistently outperform baselines and single-teacher setups, with ablations highlighting the benefits of diversity and the possibility of data-wise teacher selection to reduce cost. The findings suggest a practical path to harness foundation-model knowledge for specialized applications while mitigating architectural and distribution mismatches, enabling cost-efficient serving of high-performance downstream systems.

Abstract

Recent advancements in foundation models have yielded impressive performance across a wide range of tasks. Meanwhile, for specific applications, practitioners have been developing specialized application models. To enjoy the benefits of both kinds of models, one natural path is to transfer the knowledge in foundation models into specialized application models, which are generally more efficient for serving. Techniques from knowledge distillation may be applied here, where the application model learns to mimic the foundation model. However, specialized application models and foundation models have substantial gaps in capacity, employing distinct architectures, using different input features from different modalities, and being optimized on different distributions. These differences in model characteristics lead to significant challenges for distillation methods. In this work, we propose creating a teaching committee comprising both foundation model teachers and complementary teachers. Complementary teachers possess model characteristics akin to the student's, aiming to bridge the gap between the foundation model and specialized application models for a smoother knowledge transfer. Further, to accommodate the dissimilarity among the teachers in the committee, we introduce DiverseDistill, which allows the student to understand the expertise of each teacher and extract task knowledge. Our evaluations demonstrate that adding complementary teachers enhances student performance. Finally, DiverseDistill consistently outperforms baseline distillation methods, regardless of the teacher choices, resulting in significantly improved student performance.

Wisdom of Committee: Distilling from Foundation Model to Specialized Application Model

TL;DR

This work tackles the challenge of transferring knowledge from large foundation models to specialized, efficient downstream systems by bridging substantial gaps in capacity, architecture, distributions, and modalities. It introduces DiverseDistill, a general distillation framework featuring a Distillation Module with a Question Augmenter and an Answer Augmenter that enables interactive, knowledge-rich communication between a student and a diverse teacher committee including complementary teachers. The approach is evaluated on both recommendation and vision tasks, showing that complementary teachers and the DiverseDistill mechanism consistently outperform baselines and single-teacher setups, with ablations highlighting the benefits of diversity and the possibility of data-wise teacher selection to reduce cost. The findings suggest a practical path to harness foundation-model knowledge for specialized applications while mitigating architectural and distribution mismatches, enabling cost-efficient serving of high-performance downstream systems.

Abstract

Recent advancements in foundation models have yielded impressive performance across a wide range of tasks. Meanwhile, for specific applications, practitioners have been developing specialized application models. To enjoy the benefits of both kinds of models, one natural path is to transfer the knowledge in foundation models into specialized application models, which are generally more efficient for serving. Techniques from knowledge distillation may be applied here, where the application model learns to mimic the foundation model. However, specialized application models and foundation models have substantial gaps in capacity, employing distinct architectures, using different input features from different modalities, and being optimized on different distributions. These differences in model characteristics lead to significant challenges for distillation methods. In this work, we propose creating a teaching committee comprising both foundation model teachers and complementary teachers. Complementary teachers possess model characteristics akin to the student's, aiming to bridge the gap between the foundation model and specialized application models for a smoother knowledge transfer. Further, to accommodate the dissimilarity among the teachers in the committee, we introduce DiverseDistill, which allows the student to understand the expertise of each teacher and extract task knowledge. Our evaluations demonstrate that adding complementary teachers enhances student performance. Finally, DiverseDistill consistently outperforms baseline distillation methods, regardless of the teacher choices, resulting in significantly improved student performance.
Paper Structure (18 sections, 2 equations, 4 figures, 7 tables)

This paper contains 18 sections, 2 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 1: DiverseDistill introduces distillation parameters, specifically, a Question Augmenter and an Answer Augmenter. The Question Augmenter enables the student to understand the teacher and ask tailored questions to each teacher. The Answer Augmenter enables each teacher to reply with a task-oriented answer, which will be used as the soft label in distillation loss.
  • Figure 2: The design of the Distillation Module with the Question Augmenter and the Answer Augmenter. The Question Augmenter introduces Teacher Embedding $E$ to model each teacher's expertise, and generate questions $\mathbf{q}$ conditioned on $E$. The Answer Augmenter consists of $n$ MLPs, one for each teacher. The Answer Augmenter outputs a set of answers ${a_i}$, which will be used as the distillation target.
  • Figure 3: Illustration for the task regularizer. DiverseDistill takes losses on the teacher's predictions based on the questions and the student's predictions based on the answers.
  • Figure 4: Visualization of the importance score of 11 random data samples from MovieLens.