Efficiently Distilling LLMs for Edge Applications
Achintya Kundu, Fabian Lim, Aaron Chew, Laura Wynter, Penny Chong, Rhui Dih Lee
TL;DR
The paper tackles edge deployment of LLMs by proposing MLFS, a parameter‑efficient, multistage, low‑rank fine‑tuning framework for supertransformers that supports a palette of model sizes. It combines stage‑wise low‑rank adapters, gradient scaling across subnets, and dual knowledge/feature distillation to train many subnets within a single framework while keeping storage modest. Empirical results show encoder models can be compressed substantially with minimal performance loss and faster convergence, whereas decoder models benefit from model slicing to reduce training time, with meaningful gains on CodeLlama and Santacoder tasks. Overall, MLFS enables practical, edge‑friendly fine‑tuning and deployment of heterogeneous LLMs, balancing accuracy, storage, and compute for enterprise needs, and suggests avenues for extending gradient scaling and distillation to broader architectures.
Abstract
Supernet training of LLMs is of great interest in industrial applications as it confers the ability to produce a palette of smaller models at constant cost, regardless of the number of models (of different size / latency) produced. We propose a new method called Multistage Low-rank Fine-tuning of Super-transformers (MLFS) for parameter-efficient supernet training. We show that it is possible to obtain high-quality encoder models that are suitable for commercial edge applications, and that while decoder-only models are resistant to a comparable degree of compression, decoders can be effectively sliced for a significant reduction in training time.
