Table of Contents
Fetching ...

From Images to Words: Efficient Cross-Modal Knowledge Distillation to Language Models from Black-box Teachers

Ayan Sengupta, Shantanu Dixit, Md Shad Akhtar, Tanmoy Chakraborty

TL;DR

ARMADA is introduced, an efficient cross-modal knowledge distillation framework designed to transfer knowledge from large vision-language models, including black-box models, to language-only models, and leverages novel alignment techniques to distil knowledge without altering the teacher model, ensuring efficiency and scalability.

Abstract

Knowledge distillation (KD) methods are pivotal in compressing large pre-trained language models into smaller models, ensuring computational efficiency without significantly dropping performance. Traditional KD techniques assume homogeneity in modalities between the teacher (source) and the student (target) models. On the other hand, existing multimodal knowledge distillation methods require modality-specific pre-training of the teacher model, which is computationally infeasible in most cases. In this paper, we introduce ARMADA, an efficient cross-modal knowledge distillation framework designed to transfer knowledge from large vision-language models, including black-box models, to language-only models. Unlike existing KD techniques that rely on the internal structures of multimodal teachers or require computationally expensive pre-training, ARMADA leverages novel alignment techniques to distil knowledge without altering the teacher model, ensuring efficiency and scalability. We empirically validate ARMADA on twelve natural language understanding, eight complex generative reasoning and five instruction-tuning tasks, demonstrating consistent performance improvements in large models such as DeBERTa-v2-1.4B, OPT-1.3B, LLaMA-{3B, 7B, 8B}. ARMADA achieves up to 3.4% improvement on language understanding tasks and 2.6% boost in generative reasoning, all without requiring expensive multimodal pre-training or fine-tuning of the teacher model. Our findings challenge conventional knowledge distillation paradigms by demonstrating that even vision-language models, despite lacking direct textual understanding, can significantly enhance language models when distilled appropriately.

From Images to Words: Efficient Cross-Modal Knowledge Distillation to Language Models from Black-box Teachers

TL;DR

ARMADA is introduced, an efficient cross-modal knowledge distillation framework designed to transfer knowledge from large vision-language models, including black-box models, to language-only models, and leverages novel alignment techniques to distil knowledge without altering the teacher model, ensuring efficiency and scalability.

Abstract

Knowledge distillation (KD) methods are pivotal in compressing large pre-trained language models into smaller models, ensuring computational efficiency without significantly dropping performance. Traditional KD techniques assume homogeneity in modalities between the teacher (source) and the student (target) models. On the other hand, existing multimodal knowledge distillation methods require modality-specific pre-training of the teacher model, which is computationally infeasible in most cases. In this paper, we introduce ARMADA, an efficient cross-modal knowledge distillation framework designed to transfer knowledge from large vision-language models, including black-box models, to language-only models. Unlike existing KD techniques that rely on the internal structures of multimodal teachers or require computationally expensive pre-training, ARMADA leverages novel alignment techniques to distil knowledge without altering the teacher model, ensuring efficiency and scalability. We empirically validate ARMADA on twelve natural language understanding, eight complex generative reasoning and five instruction-tuning tasks, demonstrating consistent performance improvements in large models such as DeBERTa-v2-1.4B, OPT-1.3B, LLaMA-{3B, 7B, 8B}. ARMADA achieves up to 3.4% improvement on language understanding tasks and 2.6% boost in generative reasoning, all without requiring expensive multimodal pre-training or fine-tuning of the teacher model. Our findings challenge conventional knowledge distillation paradigms by demonstrating that even vision-language models, despite lacking direct textual understanding, can significantly enhance language models when distilled appropriately.
Paper Structure (22 sections, 11 equations, 13 figures, 17 tables)

This paper contains 22 sections, 11 equations, 13 figures, 17 tables.

Figures (13)

  • Figure 1: A schematic diagram of ARMADA. We highlight an example from the NLU task, where the text-to-image teacher model is used to distil knowledge to a language-only student model. Through manifold and output alignment loss objectives, the aligner model orchestrates the knowledge transfer from teacher to student modality.
  • Figure 2: Commutative diagram between the manifold, output and auxiliary output spaces. The bold lines denote the functions already defined between the spaces. The dashed lines indicate the functions we want to establish in the proof of Proposition 2.
  • Figure 3: Distribution of margin (referred as improvement) between the undistilled and BERT-base distilled with ARMADA and Stable Diffusion teacher model for different values of $\alpha$, $\beta$ and $\gamma$. A.$\alpha$ highlights the importance of output alignment, B.$\beta$ highlights the importance of manifold alignment between TS Aligner and the student, and C.$\gamma$ highlights the importance of auxiliary output on the combined loss objective.
  • Figure 4: Performance improved in distilled BERT-base model over undistilled variant across all tasks for different $\mathcal{L}_{manifold}$ functions.
  • Figure 5: Distribution of improvement between the undistilled and ARMADA-distilled LLaMA-3B, for different values of $\beta$ and $\gamma$. A two-way ANOVA using ordinary least square (OLS) regression model confirms the statistical significance (p-value $7e-5$ and $5e-3$ for $\beta$ and $\gamma$, respectively) of both the hyperparameters.
  • ...and 8 more figures