Table of Contents
Fetching ...

MoSE: Hierarchical Self-Distillation Enhances Early Layer Embeddings

Andrea Gurioli, Federico Pennino, João Monteiro, Maurizio Gabbrielli

TL;DR

This work addresses the deployment efficiency challenge of large language models for code understanding by balancing accuracy and latency. It introduces MoSE, a 1B-parameter, 36-layer modular multi-exit encoder built on StarCoder-2 that uses Self-Distillation to improve early-layer representations through layer-wise supervision with MLM and an In-Context Classification loss. MoSE supports exit points at layers 4, 9, 18, 27, and 36, enabling flexible inference while maintaining strong retrieval and clone-detection performance. A new SynthCoNL dataset augments text-to-code and code-to-code benchmarks with cross-language translations, and MoSE achieves state-of-the-art results among open models on CodeSearchNet and related benchmarks, demonstrating practical deployment benefits with substantial reductions in computation at early exits.

Abstract

Deploying language models often requires navigating accuracy vs. performance trade-offs to meet latency constraints while preserving utility. Traditional model distillation reduces size but incurs substantial costs through training separate models. We introduce ModularStarEncoder (MoSE), a 1-billion-parameter multi-exit encoder for code retrieval and classification that employs a novel Self-Distillation mechanism. This approach significantly enhances lower-layer representations, enabling flexible deployment of different model portions with favorable performance trade-offs. Our architecture improves text-to-code and code-to-code search by targeting specific encoder layers as exit heads, where higher layers guide earlier ones during training-improving intermediate representations at minimal additional cost. We further enhance MoSE with a repository-level contextual loss that maximizes training context window utilization. Additionally, we release a new dataset created through code translation that extends text-to-code benchmarks with cross-language code-to-code pairs. Evaluations demonstrate the effectiveness of Self-Distillation as a principled approach to trading inference cost for accuracy across various code understanding tasks.

MoSE: Hierarchical Self-Distillation Enhances Early Layer Embeddings

TL;DR

This work addresses the deployment efficiency challenge of large language models for code understanding by balancing accuracy and latency. It introduces MoSE, a 1B-parameter, 36-layer modular multi-exit encoder built on StarCoder-2 that uses Self-Distillation to improve early-layer representations through layer-wise supervision with MLM and an In-Context Classification loss. MoSE supports exit points at layers 4, 9, 18, 27, and 36, enabling flexible inference while maintaining strong retrieval and clone-detection performance. A new SynthCoNL dataset augments text-to-code and code-to-code benchmarks with cross-language translations, and MoSE achieves state-of-the-art results among open models on CodeSearchNet and related benchmarks, demonstrating practical deployment benefits with substantial reductions in computation at early exits.

Abstract

Deploying language models often requires navigating accuracy vs. performance trade-offs to meet latency constraints while preserving utility. Traditional model distillation reduces size but incurs substantial costs through training separate models. We introduce ModularStarEncoder (MoSE), a 1-billion-parameter multi-exit encoder for code retrieval and classification that employs a novel Self-Distillation mechanism. This approach significantly enhances lower-layer representations, enabling flexible deployment of different model portions with favorable performance trade-offs. Our architecture improves text-to-code and code-to-code search by targeting specific encoder layers as exit heads, where higher layers guide earlier ones during training-improving intermediate representations at minimal additional cost. We further enhance MoSE with a repository-level contextual loss that maximizes training context window utilization. Additionally, we release a new dataset created through code translation that extends text-to-code benchmarks with cross-language code-to-code pairs. Evaluations demonstrate the effectiveness of Self-Distillation as a principled approach to trading inference cost for accuracy across various code understanding tasks.

Paper Structure

This paper contains 20 sections, 12 figures, 6 tables.

Figures (12)

  • Figure 1: (a) Overview of our multi‐exit self‐distillation encoder, shown here with exit heads at selected layers (e.g., Layers 4, 9, 18, 27, and 36). Each exit head predicts an output embedding and adds a layer loss, contribution weighted by a coefficient $\alpha_i$, summed into the overall objective $\mathcal{L}$. (b) Computational cost (GFLOPs) vs. performances trade-off at different exit layers over CodeSearchNet dataset (text-to-code in avg. MRR) and POJ104 (code-to-code in mAP). Despite a reduction of approximately 90% floating point operations from layer 36 to layer 4, MRR performance only drops by 6.4% in absolute terms. For POJ104, our best results are observed in the initial layers.
  • Figure 2: The illustration on the left depicts the in-context loss framework, where samples from various repositories are concatenated. Positive examples originate from the same repository context, whereas negative examples are sourced from different repositories. To enable the model's use of FlashAttention V2, we applied left padding and positioned the CLS token at the end of the sentence. On the right side, you'll find the pseudocode for the in-context loss framework.
  • Figure 3: Prompt provided to Qwen2.5-Coder-7B-Instruct for translating a given code snippet ( print("Hello World") in the example) from a source programming language (Python) to a target one (Rust).
  • Figure 4: Performance of different models on CT and POJ104 for code-to-code retrieval with CodeXGLUE dataset. MoSE has state-of-the-art results with the CT benchmark and outperforms all the open-source models over the POJ104 benchmark. Compared to closed-source solutions (OpenAItext-embedding-3-large), MoSE performs on par in the code translation task and strongly reduces the gap between open- and closed-source models with the POJ104 benchmark.
  • Figure 5: Similarity score Heatmap presenting results from permutation tests (10,000 permutations with $\alpha$ = 0.05) across different exit points for Python-to-Java retrieval. * indicates p < 0.05 and ** indicates p < 0.001. Different layers, despite a common training objective, yield different similarity scores.
  • ...and 7 more figures