Table of Contents
Fetching ...

Skip to the Good Part: Representation Structure & Inference-Time Layer Skipping in Diffusion vs. Autoregressive LLMs

Raghavv Goel, Risheek Garrepalli, Sudhanshu Agrawal, Chris Lott, Mingu Lee, Fatih Porikli

TL;DR

A static, task-agnostic inference-time layer-skipping method requiring no architectural changes or KV-cache sharing is introduced to link training objectives to representational structure and enable practical, cache-orthogonal efficiency gains.

Abstract

Autoregressive (AR) language models form representations incrementally through left-to-right prediction, whereas diffusion language models (dLLMs) are trained via full-sequence denoising. Although recent dLLMs match AR performance, it remains unclear whether diffusion objectives fundamentally reshape internal representations across depth. We perform the first layer- and token-wise representational analysis comparing native dLLMs (LLaDA), native AR models (Qwen2.5), and AR-initialized dLLMs (Dream-7B). We find that diffusion objectives result in different, more hierarchical abstractions with substantial early-layer redundancy and reduced recency bias, while AR objectives produce tightly coupled, depth-dependent representations. Critically, AR-initialized dLLMs retain AR-like representational dynamics despite diffusion training, revealing persistent initialization bias. Leveraging this observed representational redundancy, we introduce a static, task-agnostic inference-time layer-skipping method requiring no architectural changes or KV-cache sharing. Native dLLMs achieve up to 18.75% FLOPs reduction while preserving over 90% performance on reasoning and code generation benchmarks, whereas AR models degrade sharply under comparable skipping. These results link training objectives to representational structure and enable practical, cache-orthogonal efficiency gains.

Skip to the Good Part: Representation Structure & Inference-Time Layer Skipping in Diffusion vs. Autoregressive LLMs

TL;DR

A static, task-agnostic inference-time layer-skipping method requiring no architectural changes or KV-cache sharing is introduced to link training objectives to representational structure and enable practical, cache-orthogonal efficiency gains.

Abstract

Autoregressive (AR) language models form representations incrementally through left-to-right prediction, whereas diffusion language models (dLLMs) are trained via full-sequence denoising. Although recent dLLMs match AR performance, it remains unclear whether diffusion objectives fundamentally reshape internal representations across depth. We perform the first layer- and token-wise representational analysis comparing native dLLMs (LLaDA), native AR models (Qwen2.5), and AR-initialized dLLMs (Dream-7B). We find that diffusion objectives result in different, more hierarchical abstractions with substantial early-layer redundancy and reduced recency bias, while AR objectives produce tightly coupled, depth-dependent representations. Critically, AR-initialized dLLMs retain AR-like representational dynamics despite diffusion training, revealing persistent initialization bias. Leveraging this observed representational redundancy, we introduce a static, task-agnostic inference-time layer-skipping method requiring no architectural changes or KV-cache sharing. Native dLLMs achieve up to 18.75% FLOPs reduction while preserving over 90% performance on reasoning and code generation benchmarks, whereas AR models degrade sharply under comparable skipping. These results link training objectives to representational structure and enable practical, cache-orthogonal efficiency gains.
Paper Structure (18 sections, 14 figures, 3 tables, 1 algorithm)

This paper contains 18 sections, 14 figures, 3 tables, 1 algorithm.

Figures (14)

  • Figure 1: Representational redundancy enables efficient inference-time layer skipping. We evaluate zero-shot layer pruning across three architectures: LLaDA (diffusion LLM), Dream7B (dLLM initialized from Qwen2.5), and Qwen2.5-7B (AR-LLM). The diffusion-based LLaDA exhibits remarkable robustness, retaining 88.24% performance at 18.75% FLOPs reduction (6 layers skipped), demonstrating significant representational redundancy. Conversely, autoregressive models show brittle behavior with only 64.71% retention at 7.14% FLOPs reduction (2 layers), revealing concentrated, non-redundant representations. Top-right region indicates optimal performance-efficiency trade-off.
  • Figure 2: Layer-skip mechanism for dLLMs. At each denoising step, high-similarity layers (shaded) are bypassed, with hidden states passed directly to the next active layer. This reduces per-step FLOPs while preserving the coarse-to-fine abstraction hierarchy.
  • Figure 3: Average token-wise cosine similarity across layers and denoising steps. LLaDA (native dLLM) exhibits high similarity ($>0.9$) in early layers with smooth transitions, followed by lower similarity in later layers where refinement occurs. Dream-7B closely follows Qwen2.5's pattern despite diffusion training, revealing persistent initialization bias. Shaded regions show standard deviation across denoising steps for LLaDA and Dream-7B.
  • Figure 4: Layer-wise cosine similarity across models 32 tokens decoded. Each row shows similarity between consecutive layers for (top) LLaDA, (middle) Qwen2.5, and (bottom) Dream-7B. High-similarity regions (yellow) indicate representational redundancy. Dream-7B's pattern closely resembles Qwen2.5 despite diffusion training, revealing strong initialization bias.
  • Figure 5: Layer-wise cosine similarity across models all tokens decoded. Each row shows similarity between consecutive layers for (top) LLaDA, (middle) Qwen2.5, and (bottom) Dream-7B. High-similarity regions (yellow) indicate representational redundancy. Dream-7B's pattern closely resembles Qwen2.5 despite diffusion training, revealing strong initialization bias.
  • ...and 9 more figures