Table of Contents
Fetching ...

Efficiently Distilling LLMs for Edge Applications

Achintya Kundu, Fabian Lim, Aaron Chew, Laura Wynter, Penny Chong, Rhui Dih Lee

TL;DR

The paper tackles edge deployment of LLMs by proposing MLFS, a parameter‑efficient, multistage, low‑rank fine‑tuning framework for super­transformers that supports a palette of model sizes. It combines stage‑wise low‑rank adapters, gradient scaling across subnets, and dual knowledge/feature distillation to train many subnets within a single framework while keeping storage modest. Empirical results show encoder models can be compressed substantially with minimal performance loss and faster convergence, whereas decoder models benefit from model slicing to reduce training time, with meaningful gains on CodeLlama and Santacoder tasks. Overall, MLFS enables practical, edge‑friendly fine‑tuning and deployment of heterogeneous LLMs, balancing accuracy, storage, and compute for enterprise needs, and suggests avenues for extending gradient scaling and distillation to broader architectures.

Abstract

Supernet training of LLMs is of great interest in industrial applications as it confers the ability to produce a palette of smaller models at constant cost, regardless of the number of models (of different size / latency) produced. We propose a new method called Multistage Low-rank Fine-tuning of Super-transformers (MLFS) for parameter-efficient supernet training. We show that it is possible to obtain high-quality encoder models that are suitable for commercial edge applications, and that while decoder-only models are resistant to a comparable degree of compression, decoders can be effectively sliced for a significant reduction in training time.

Efficiently Distilling LLMs for Edge Applications

TL;DR

The paper tackles edge deployment of LLMs by proposing MLFS, a parameter‑efficient, multistage, low‑rank fine‑tuning framework for super­transformers that supports a palette of model sizes. It combines stage‑wise low‑rank adapters, gradient scaling across subnets, and dual knowledge/feature distillation to train many subnets within a single framework while keeping storage modest. Empirical results show encoder models can be compressed substantially with minimal performance loss and faster convergence, whereas decoder models benefit from model slicing to reduce training time, with meaningful gains on CodeLlama and Santacoder tasks. Overall, MLFS enables practical, edge‑friendly fine‑tuning and deployment of heterogeneous LLMs, balancing accuracy, storage, and compute for enterprise needs, and suggests avenues for extending gradient scaling and distillation to broader architectures.

Abstract

Supernet training of LLMs is of great interest in industrial applications as it confers the ability to produce a palette of smaller models at constant cost, regardless of the number of models (of different size / latency) produced. We propose a new method called Multistage Low-rank Fine-tuning of Super-transformers (MLFS) for parameter-efficient supernet training. We show that it is possible to obtain high-quality encoder models that are suitable for commercial edge applications, and that while decoder-only models are resistant to a comparable degree of compression, decoders can be effectively sliced for a significant reduction in training time.
Paper Structure (17 sections, 2 theorems, 15 equations, 9 figures, 2 tables, 2 algorithms)

This paper contains 17 sections, 2 theorems, 15 equations, 9 figures, 2 tables, 2 algorithms.

Key Result

Proposition 1

Let the individually fine-tuned weights of a subnet, $\Phi$, be expressed as $\mathrm{W}_{\Phi} = \boldsymbol{\Pi}_{\Phi} ( \mathrm{W}^{\mathrm{pretrain}}_{\texttt{Tch}} ) + \Delta \mathrm{W}_{\Phi}$. Then, MLFS has the following structure on $\Delta \mathrm{W}_{\Phi}$: where $\{A_s, B_s \}_{s=0,1,2}$ are low-rank matrices shared across all sub-transformers $\Phi \in \mathcal{A}$.

Figures (9)

  • Figure 1: Model size vs performance trade-off for task-specific BERT models produced by MLFS against other methods on 6 GLUE data sets.
  • Figure 2: Latency vs performance trade-off for task-specific BERT models produced by MLFS against other methods on 6 GLUE data sets.
  • Figure 3: Ablation study on gradient scaling: MLFS minnet convergence is improved using gradient scaling.
  • Figure 4: Ablation study on MLFS rank of $A,B$. Maxnet (top: blue), minnet (bottom: green), and average of two medium-sized subnets (middle: orange). Rank $r=8$ is optimal for small and medium subnets.
  • Figure 5: Performance of MLFS on a custom Santacoder 0.7B model using 10K/400K/1.2M training examples.
  • ...and 4 more figures

Theorems & Definitions (2)

  • Proposition 1
  • Proposition 2