Table of Contents
Fetching ...

QTALE: Quantization-Robust Token-Adaptive Layer Execution for LLMs

Kanghyun Noh, Jinheon Choi, Yulwha Kim

TL;DR

The paper tackles the challenge of efficiently deploying large language models by combining token-adaptive layer execution with quantization. It introduces QTALE, a framework that (i) employs quantization-robust training to preserve diverse execution paths and (ii) provides an inference-time execution-ratio adjustment to reintroduce redundancy as needed. Empirical results across multiple LLaMA variants show that QTALE preserves accuracy close to quantized full models while achieving significant memory savings (4-bit quantization) and comparable FLOP reductions, with gaps to quantization-only baselines under 0.5% on CommonsenseQA. The approach yields practical speedups and storage reductions, enabling cost-effective deployment on constrained hardware, and demonstrates strong robustness across PTQ methods and auxiliary compression techniques.

Abstract

Large language models (LLMs) demand substantial computational and memory resources, posing challenges for efficient deployment. Two complementary approaches have emerged to address these issues: token-adaptive layer execution, which reduces floating-point operations (FLOPs) by selectively bypassing layers, and quantization, which lowers memory footprint by reducing weight precision. However, naively integrating these techniques leads to additional accuracy degradation due to reduced redundancy in token-adaptive models. We propose QTALE (Quantization-Robust Token-Adaptive Layer Execution for LLMs), a novel framework that enables seamless integration of token-adaptive execution with quantization while preserving accuracy. Conventional token-adaptive methods reduce redundancy in two ways: (1) by limiting the diversity of training paths explored during fine-tuning, and (2) by lowering the number of parameters actively involved in inference. To overcome these limitations, QTALE introduces two key components: (1) a training strategy that ensures diverse execution paths are actively explored during fine-tuning, and (2) a post-training mechanism that allows flexible adjustment of the execution ratio at inference to reintroduce redundancy when needed. Experimental results show that QTALE enables seamless integration of token-adaptive layer execution with quantization, showing no noticeable accuracy difference, with the gap to quantization-only models kept below 0.5% on CommonsenseQA benchmarks. By combining token-adaptive execution for FLOPs reduction and quantization for memory savings, QTALE provides an effective solution for efficient LLM deployment.

QTALE: Quantization-Robust Token-Adaptive Layer Execution for LLMs

TL;DR

The paper tackles the challenge of efficiently deploying large language models by combining token-adaptive layer execution with quantization. It introduces QTALE, a framework that (i) employs quantization-robust training to preserve diverse execution paths and (ii) provides an inference-time execution-ratio adjustment to reintroduce redundancy as needed. Empirical results across multiple LLaMA variants show that QTALE preserves accuracy close to quantized full models while achieving significant memory savings (4-bit quantization) and comparable FLOP reductions, with gaps to quantization-only baselines under 0.5% on CommonsenseQA. The approach yields practical speedups and storage reductions, enabling cost-effective deployment on constrained hardware, and demonstrates strong robustness across PTQ methods and auxiliary compression techniques.

Abstract

Large language models (LLMs) demand substantial computational and memory resources, posing challenges for efficient deployment. Two complementary approaches have emerged to address these issues: token-adaptive layer execution, which reduces floating-point operations (FLOPs) by selectively bypassing layers, and quantization, which lowers memory footprint by reducing weight precision. However, naively integrating these techniques leads to additional accuracy degradation due to reduced redundancy in token-adaptive models. We propose QTALE (Quantization-Robust Token-Adaptive Layer Execution for LLMs), a novel framework that enables seamless integration of token-adaptive execution with quantization while preserving accuracy. Conventional token-adaptive methods reduce redundancy in two ways: (1) by limiting the diversity of training paths explored during fine-tuning, and (2) by lowering the number of parameters actively involved in inference. To overcome these limitations, QTALE introduces two key components: (1) a training strategy that ensures diverse execution paths are actively explored during fine-tuning, and (2) a post-training mechanism that allows flexible adjustment of the execution ratio at inference to reintroduce redundancy when needed. Experimental results show that QTALE enables seamless integration of token-adaptive layer execution with quantization, showing no noticeable accuracy difference, with the gap to quantization-only models kept below 0.5% on CommonsenseQA benchmarks. By combining token-adaptive execution for FLOPs reduction and quantization for memory savings, QTALE provides an effective solution for efficient LLM deployment.
Paper Structure (27 sections, 9 equations, 10 figures, 16 tables)

This paper contains 27 sections, 9 equations, 10 figures, 16 tables.

Figures (10)

  • Figure 1: Overview of a standard LLM architecture and representative techniques for efficient inference. The fraction of color fill in each transformer layer denotes memory cost per layer, while dashed gray outlines indicate skipped execution.
  • Figure 2: Heatmap of the average execution ratio for each layer of LLaMA3.1-8B with D-LLM. The ratios are measured on the first 200 training samples after fine-tuning epochs 0, 3, and 6, across four CommonsenseQA datasets: ARCe, ARCc, SIQA, and PIQA.
  • Figure 3: Execution behavior of D-LLM. (a) Execution ratio, (b) execution decision flipping induced by Gumbel noise across fine-tuning epochs, and (c) histogram of samples from $\pi \sim \mathrm{Gumbel}(0,1)$. Results are shown for layers 20, 23, and 26 of LLaMA3.1-8B on ARCe.
  • Figure 4: Histograms of router output logits for three low-execution layers (20, 23, and 26) of LLaMA3.1-8B on ARCe. Logits are computed from the first 200 training samples after fine-tuning epochs 0, 3, and 6.
  • Figure 5: Comparison of Gumbel-noise–induced execution decision flipping across fine-tuning epochs between D-LLM and the proposed quantization-robust training. Results are shown for three low-execution layers (20, 23, and 26) of LLaMA3.1-8B on the ARCe dataset.
  • ...and 5 more figures