Table of Contents
Fetching ...

When Perplexity Lies: Generation-Focused Distillation of Hybrid Sequence Models

Juan Gabriel Kostelec, Xiang Wang, Axel Laborieux, Christos Sourmpis, Qinghai Guo

Abstract

Converting a pretrained Transformer into a more efficient hybrid model through distillation offers a promising approach to reducing inference costs. However, achieving high-quality generation in distilled models requires careful joint design of both the student architecture and the distillation process. Many prior distillation works evaluate downstream multiple-choice benchmarks by ranking candidate answers with log-likelihood rather than requiring autoregressive generation, which can obscure important differences in model quality. For example, we show that a 7B parameter distilled model that nearly matches its teacher to within 0.2\,pp under log-likelihood scoring actually falls behind by 20.8\,pp when the model must generate answers autoregressively. We propose a Hybrid Kimi Delta Attention (Hybrid-KDA) architecture paired with GenDistill, a multi-stage distillation pipeline, and use generation-based evaluation throughout to guide design decisions. Applying this approach to Qwen3-0.6B, we systematically ablate six design axes: training objective, loss masking, training duration, dataset selection, parameter freezing, and architecture choice. We find that log-likelihood-based evaluation consistently underestimates the gap between teacher and student, and can in some cases reverse the ranking of design choices, meaning that conclusions drawn from perplexity-only evaluation may be misleading. Among the factors we study, dataset selection, completion-only masking, and freezing attention layers during post-training have the largest impact on generation quality. Our best Hybrid-KDA model retains 86--90\% of teacher accuracy on knowledge benchmarks while reducing KV cache memory by up to 75\% and improving time-to-first-token by 2--4$\times$ at 128K-token contexts.

When Perplexity Lies: Generation-Focused Distillation of Hybrid Sequence Models

Abstract

Converting a pretrained Transformer into a more efficient hybrid model through distillation offers a promising approach to reducing inference costs. However, achieving high-quality generation in distilled models requires careful joint design of both the student architecture and the distillation process. Many prior distillation works evaluate downstream multiple-choice benchmarks by ranking candidate answers with log-likelihood rather than requiring autoregressive generation, which can obscure important differences in model quality. For example, we show that a 7B parameter distilled model that nearly matches its teacher to within 0.2\,pp under log-likelihood scoring actually falls behind by 20.8\,pp when the model must generate answers autoregressively. We propose a Hybrid Kimi Delta Attention (Hybrid-KDA) architecture paired with GenDistill, a multi-stage distillation pipeline, and use generation-based evaluation throughout to guide design decisions. Applying this approach to Qwen3-0.6B, we systematically ablate six design axes: training objective, loss masking, training duration, dataset selection, parameter freezing, and architecture choice. We find that log-likelihood-based evaluation consistently underestimates the gap between teacher and student, and can in some cases reverse the ranking of design choices, meaning that conclusions drawn from perplexity-only evaluation may be misleading. Among the factors we study, dataset selection, completion-only masking, and freezing attention layers during post-training have the largest impact on generation quality. Our best Hybrid-KDA model retains 86--90\% of teacher accuracy on knowledge benchmarks while reducing KV cache memory by up to 75\% and improving time-to-first-token by 2--4 at 128K-token contexts.

Paper Structure

This paper contains 45 sections, 4 equations, 5 figures, 21 tables.

Figures (5)

  • Figure 1: Architecture overview. Left: The two block types in the hybrid transformer: an Attention block (7 of 28 layers, frozen from the teacher) and a KDA block (21 of 28 layers). MLPs and layer norms are initialized from the teacher and trained only during end-to-end distillation (Stage 3). Right: Internal structure of the KDA layer. Fill colors indicate the earliest distillation stage in which each component begins training; once introduced, components continue training in all subsequent stages. Green dashed borders denote initialization from teacher weights; dark-red dotted borders denote identity initialization; the absence of a custom border denotes random initialization.
  • Figure 2: Inference efficiency comparison (single GPU, batch size 1). (a) Memory usage: SSMs use constant $\mathcal{O}(1)$ state, reducing hybrid memory footprint by up to 75%. (b) Time to First Token: Linear SSM scaling yields 2--4$\times$ speedups at long contexts (128K) despite short-context overhead. (c) Decode speed: SSMs maintain constant throughput; KDA reaches parity with Transformers at 32K tokens.
  • Figure 3: Held-out KL loss during Stage 3b for different Stage 3a token budgets.
  • Figure 4: Evaluation loss curves for attention layer selection methods during Stage 3b. Greedy Learned achieves the lowest loss but Beam Add yields better downstream performance.
  • Figure 5: (a) Theoretical cache memory vs. sequence length. Attention cache grows linearly with context; SSM state is constant. (b) Peak throughput (sweeping batch size to OOM) at four context lengths. Annotations show relative throughput vs. the Teacher. At 32K+ tokens, SSM variants achieve 1.9--2.5$\times$ the teacher's peak throughput by fitting larger batches.