Table of Contents
Fetching ...

BitSkip: An Empirical Analysis of Quantization and Early Exit Composition

Ramshankar Bhuvaneswaran, Handan Liu

TL;DR

This paper tackles the compositional effects of combining quantization with dynamic routing in large language models. It introduces BitSkip, a framework coupling BitLinear quantization with LayerSkip-based early exits, and evaluates three variants (8-bit no Hadamard, 4-bit with Hadamard, and 8-bit with Hadamard) against a full-precision baseline using a two-phase training protocol on the TinyStories dataset. The key findings are that BitSkip-V1 achieves perplexity $1.13$ (vs $1.19$ for the baseline) and substantial early-exit speedups (up to $32.5\%$ at layer $18$ with only $4\%$ quality loss), while Hadamard transforms catastrophically degrade learning even at 8-bit precision (over $3.7\times 10^4\%$ degradation). The results reveal a variance-quality paradox: activation stabilization via Hadamard does not guarantee learning, and simpler co-designed architectures can outperform more complex, theoretically advantageous combinations, providing practical guidance for designing efficient LLMs. Limitations include the dataset size and scope of techniques, suggesting future work in scaling, theory, predictive composition models, and broader efficiency methods.

Abstract

The pursuit of efficient Large Language Models (LLMs) has led to increasingly complex techniques like extreme quantization and dynamic routing. While individual benefits of these methods are well-documented, their compositional effects remain poorly understood. This paper introduces BitSkip, a hybrid architectural framework for systematically exploring these interactions. Counter-intuitively, our findings reveal that a simple 8-bit quantized model without Hadamard transform (BitSkip-V1) not only outperforms its more complex 4-bit and Hadamard-enhanced counterparts but also competes the full-precision baseline in quality (perplexity of 1.13 vs 1.19) . The introduction of Hadamard transforms, even at 8-bit precision, catastrophically degraded performance by over 37,000%, tracing fundamental training instability. Our BitSkip-V1 recipe demonstrates superior early-exit characteristics, with layer 18 providing optimal 32.5% speed gain for minimal 4% quality loss.

BitSkip: An Empirical Analysis of Quantization and Early Exit Composition

TL;DR

This paper tackles the compositional effects of combining quantization with dynamic routing in large language models. It introduces BitSkip, a framework coupling BitLinear quantization with LayerSkip-based early exits, and evaluates three variants (8-bit no Hadamard, 4-bit with Hadamard, and 8-bit with Hadamard) against a full-precision baseline using a two-phase training protocol on the TinyStories dataset. The key findings are that BitSkip-V1 achieves perplexity (vs for the baseline) and substantial early-exit speedups (up to at layer with only quality loss), while Hadamard transforms catastrophically degrade learning even at 8-bit precision (over degradation). The results reveal a variance-quality paradox: activation stabilization via Hadamard does not guarantee learning, and simpler co-designed architectures can outperform more complex, theoretically advantageous combinations, providing practical guidance for designing efficient LLMs. Limitations include the dataset size and scope of techniques, suggesting future work in scaling, theory, predictive composition models, and broader efficiency methods.

Abstract

The pursuit of efficient Large Language Models (LLMs) has led to increasingly complex techniques like extreme quantization and dynamic routing. While individual benefits of these methods are well-documented, their compositional effects remain poorly understood. This paper introduces BitSkip, a hybrid architectural framework for systematically exploring these interactions. Counter-intuitively, our findings reveal that a simple 8-bit quantized model without Hadamard transform (BitSkip-V1) not only outperforms its more complex 4-bit and Hadamard-enhanced counterparts but also competes the full-precision baseline in quality (perplexity of 1.13 vs 1.19) . The introduction of Hadamard transforms, even at 8-bit precision, catastrophically degraded performance by over 37,000%, tracing fundamental training instability. Our BitSkip-V1 recipe demonstrates superior early-exit characteristics, with layer 18 providing optimal 32.5% speed gain for minimal 4% quality loss.

Paper Structure

This paper contains 20 sections, 7 equations, 1 figure, 5 tables.

Figures (1)

  • Figure 1: Activation variance (standard deviation) across layer depth for all model variants. Lower and more stable variance indicates better-controlled activation distributions, which correlates with training stability.