Constrained Edge AI Deployment: Fine-Tuning vs Distillation for LLM Compression
Jacob Sander, David Moe, Achraf Cohen, Brent Venable, Venkat Dasari, Brian Jalaian
TL;DR
This work confronts the challenge of deploying large language models on resource-constrained edge devices by isolating the effect of the re-training loss after a fixed, layer-wise $L_2$-norm pruning applied to Transformer MLP blocks. It compares cross-entropy fine-tuning (CE) with KL-divergence self-distillation (self-distillation using teacher logits) on the OLMo2-7B-SFT model for CommonsenseQA. The key finding is that KL-based self-distillation often matches or exceeds CE fine-tuning under identical pruning, yielding a 3–5% improvement in test accuracy at around 50% parameter retention, and showing more favorable uncertainty behavior as data scales. These results highlight the critical role of loss-function design in compressed-model optimization for edge deployment, suggesting that self-distillation can offer practical benefits in data-sparse, latency-constrained environments, and guiding future work toward broader pruning and quantization strategies for full Transformer architectures.
Abstract
Modern foundational models are often compressed via a combination of structured pruning and re-training to meet the strict compute, memory, and connectivity constraints of edge deployments. While state-of-the-art pruning schemes target the entire Transformer, we adopt a simple, layer-wise L2-norm pruning on only the MLP blocks as a fixed baseline. Our focus is not on achieving maximal compression, but on isolating the impact of the re-training loss function: (i) Fine-tuning with Cross- Entropy (L2PFT), which requires labeled data, versus (ii) Self-Distillation with KL-divergence, which leverages only teacher logits (no labels) (L2PSD). We evaluate both pipelines on the OLMo2- 7B-SFT model for CommonsenseQA suitable for intermittent or denied connectivity scenarios typical of edge networks. Under identical pruning schedules, KL-based distillation matches or exceeds CE fine-tuning in test accuracy, demonstrating that, even with a basic MLP-only pruning, the choice of loss function materially affects compressed model recovery in resource-constrained environments.
