Efficiently Distilling LLMs for Edge Applications

Achintya Kundu; Fabian Lim; Aaron Chew; Laura Wynter; Penny Chong; Rhui Dih Lee

Efficiently Distilling LLMs for Edge Applications

Achintya Kundu, Fabian Lim, Aaron Chew, Laura Wynter, Penny Chong, Rhui Dih Lee

TL;DR

The paper tackles edge deployment of LLMs by proposing MLFS, a parameter‑efficient, multistage, low‑rank fine‑tuning framework for supertransformers that supports a palette of model sizes. It combines stage‑wise low‑rank adapters, gradient scaling across subnets, and dual knowledge/feature distillation to train many subnets within a single framework while keeping storage modest. Empirical results show encoder models can be compressed substantially with minimal performance loss and faster convergence, whereas decoder models benefit from model slicing to reduce training time, with meaningful gains on CodeLlama and Santacoder tasks. Overall, MLFS enables practical, edge‑friendly fine‑tuning and deployment of heterogeneous LLMs, balancing accuracy, storage, and compute for enterprise needs, and suggests avenues for extending gradient scaling and distillation to broader architectures.

Abstract

Supernet training of LLMs is of great interest in industrial applications as it confers the ability to produce a palette of smaller models at constant cost, regardless of the number of models (of different size / latency) produced. We propose a new method called Multistage Low-rank Fine-tuning of Super-transformers (MLFS) for parameter-efficient supernet training. We show that it is possible to obtain high-quality encoder models that are suitable for commercial edge applications, and that while decoder-only models are resistant to a comparable degree of compression, decoders can be effectively sliced for a significant reduction in training time.

Efficiently Distilling LLMs for Edge Applications

TL;DR

Abstract

Paper Structure (17 sections, 2 theorems, 15 equations, 9 figures, 2 tables, 2 algorithms)

This paper contains 17 sections, 2 theorems, 15 equations, 9 figures, 2 tables, 2 algorithms.

Introduction
Related Work
Solution Design
Problem Formulation
Multistage Low-rank Fine-tuning of Super-transformers
Gradient Scaling
Proof:
Distillation Loss for Super-transformers:
Low-rank approach for Distilling an LLM onto a Pre-trained student
Results on Encoder and Decoder LLMs
Performance of Encoder Models
Ablation Study on Gradient Scaling
Ablation Study on Rank of $A,B$
Performance of Decoder Models
Results on CodeLlama-7B-Python
...and 2 more sections

Key Result

Proposition 1

Let the individually fine-tuned weights of a subnet, $\Phi$, be expressed as $\mathrm{W}_{\Phi} = \boldsymbol{\Pi}_{\Phi} ( \mathrm{W}^{\mathrm{pretrain}}_{\texttt{Tch}} ) + \Delta \mathrm{W}_{\Phi}$. Then, MLFS has the following structure on $\Delta \mathrm{W}_{\Phi}$: where $\{A_s, B_s \}_{s=0,1,2}$ are low-rank matrices shared across all sub-transformers $\Phi \in \mathcal{A}$.

Figures (9)

Figure 1: Model size vs performance trade-off for task-specific BERT models produced by MLFS against other methods on 6 GLUE data sets.
Figure 2: Latency vs performance trade-off for task-specific BERT models produced by MLFS against other methods on 6 GLUE data sets.
Figure 3: Ablation study on gradient scaling: MLFS minnet convergence is improved using gradient scaling.
Figure 4: Ablation study on MLFS rank of $A,B$. Maxnet (top: blue), minnet (bottom: green), and average of two medium-sized subnets (middle: orange). Rank $r=8$ is optimal for small and medium subnets.
Figure 5: Performance of MLFS on a custom Santacoder 0.7B model using 10K/400K/1.2M training examples.
...and 4 more figures

Theorems & Definitions (2)

Proposition 1
Proposition 2

Efficiently Distilling LLMs for Edge Applications

TL;DR

Abstract

Efficiently Distilling LLMs for Edge Applications

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (2)