Table of Contents
Fetching ...

When Babies Teach Babies: Can student knowledge sharing outperform Teacher-Guided Distillation on small datasets?

Srikrishna Iyer

TL;DR

This work addresses the limitation of treating students equally by formulating weighted mutual learning as a bi-level optimization problem, and eliminates the need for a teacher model, reducing computational requirements.

Abstract

We present our submission to the BabyLM challenge, aiming to push the boundaries of data-efficient language model pretraining. Our method builds upon deep mutual learning, introducing a student model search for diverse initialization. We address the limitation of treating students equally by formulating weighted mutual learning as a bi-level optimization problem. The inner loop learns compact students through online distillation, while the outer loop optimizes weights for better knowledge distillation from diverse students. This dynamic weighting strategy eliminates the need for a teacher model, reducing computational requirements. Our evaluations show that teacher-less methods can match or surpass teacher-supervised approaches.

When Babies Teach Babies: Can student knowledge sharing outperform Teacher-Guided Distillation on small datasets?

TL;DR

This work addresses the limitation of treating students equally by formulating weighted mutual learning as a bi-level optimization problem, and eliminates the need for a teacher model, reducing computational requirements.

Abstract

We present our submission to the BabyLM challenge, aiming to push the boundaries of data-efficient language model pretraining. Our method builds upon deep mutual learning, introducing a student model search for diverse initialization. We address the limitation of treating students equally by formulating weighted mutual learning as a bi-level optimization problem. The inner loop learns compact students through online distillation, while the outer loop optimizes weights for better knowledge distillation from diverse students. This dynamic weighting strategy eliminates the need for a teacher model, reducing computational requirements. Our evaluations show that teacher-less methods can match or surpass teacher-supervised approaches.

Paper Structure

This paper contains 33 sections, 4 equations, 4 figures, 9 tables.

Figures (4)

  • Figure 1: Overview of the difference between Vanilla knowledge distillation and our approach, Diversity induced weighted mutual learning (DWML). (a) hinton_distilling_2015 is the popular knowledge distillation method, where the student network (RoBERTa-xM) can only learn from a trained teacher network (RoBERTa-base-125M). Here xM refers to a student model of x million parameters. (b) is the Diversity Induced Weight Mutual Learning (DWML) framework where each student model is initialised with parameter counts = $N/2, N/3 .. N/(p+1)$ using Bayesian optimisation search. Rather than averaging the knowledge from students, DWML leverages bi-level optimization to estimate the relative importance of each student (e.g., weight $\omega_i$ for student $i$).
  • Figure 2: Performance comparison across different experimental settings for 10M dataset: (left) varying number of peers, showing how model performance changes with different peer counts; (middle) impact of alpha parameter in the loss function on model accuracy; (right) relationship between relative importance and accuracy for different model sizes.
  • Figure 3: Peer importance weights dynamically trained using mirror descent algorithm as described in Equation \ref{['mirror']}
  • Figure 4: GPU utilization for different distillation methods.