MoSE: Hierarchical Self-Distillation Enhances Early Layer Embeddings
Andrea Gurioli, Federico Pennino, João Monteiro, Maurizio Gabbrielli
TL;DR
This work addresses the deployment efficiency challenge of large language models for code understanding by balancing accuracy and latency. It introduces MoSE, a 1B-parameter, 36-layer modular multi-exit encoder built on StarCoder-2 that uses Self-Distillation to improve early-layer representations through layer-wise supervision with MLM and an In-Context Classification loss. MoSE supports exit points at layers 4, 9, 18, 27, and 36, enabling flexible inference while maintaining strong retrieval and clone-detection performance. A new SynthCoNL dataset augments text-to-code and code-to-code benchmarks with cross-language translations, and MoSE achieves state-of-the-art results among open models on CodeSearchNet and related benchmarks, demonstrating practical deployment benefits with substantial reductions in computation at early exits.
Abstract
Deploying language models often requires navigating accuracy vs. performance trade-offs to meet latency constraints while preserving utility. Traditional model distillation reduces size but incurs substantial costs through training separate models. We introduce ModularStarEncoder (MoSE), a 1-billion-parameter multi-exit encoder for code retrieval and classification that employs a novel Self-Distillation mechanism. This approach significantly enhances lower-layer representations, enabling flexible deployment of different model portions with favorable performance trade-offs. Our architecture improves text-to-code and code-to-code search by targeting specific encoder layers as exit heads, where higher layers guide earlier ones during training-improving intermediate representations at minimal additional cost. We further enhance MoSE with a repository-level contextual loss that maximizes training context window utilization. Additionally, we release a new dataset created through code translation that extends text-to-code benchmarks with cross-language code-to-code pairs. Evaluations demonstrate the effectiveness of Self-Distillation as a principled approach to trading inference cost for accuracy across various code understanding tasks.
