Table of Contents
Fetching ...

A Systematic Study of Compression Ordering for Large Language Models

Shivansh Chhawri, Rahul Mahadik, Suparna Rooj

TL;DR

The paper tackles the problem of how to sequence compression techniques—Knowledge Distillation (KD), Structured Pruning (P), and Quantization (Q)—to efficiently deploy large language models. It systematically evaluates independent techniques and six three-technique orderings on Qwen2.5-based models, using perplexity, G-Eval, prompt alignment, and clarity as evaluation criteria. The key finding is that quantization provides the strongest standalone compression, but the final model quality depends heavily on ordering; notably, Pruning → Knowledge Distillation → Quantization (P-KD-Q) achieves the best balance with a compression ratio of about $3.68$-fold while preserving instruction-following capabilities. This work offers a practical, ordering-aware pipeline for deploying compressed LLMs in resource-constrained environments and motivates future exploration of adaptive pruning and mixed-precision quantization across larger Qwen variants and multimodal models.

Abstract

Large Language Models (LLMs) require substantial computational resources, making model compression essential for efficient deployment in constrained environments. Among the dominant compression techniques: knowledge distillation, structured pruning, and low-bit quantization, their individual effects are well studied, but their interactions and optimal sequencing remain unclear. This work systematically examines how these techniques perform both independently and in combination when applied to the Qwen2.5 3B model. We evaluate multiple compression pipelines, including single, and proposed three-technique sequences, using perplexity, G-Eval, clarity, prompt alignment, and compression ratio as metrics. Our experiments show that quantization provides the greatest standalone compression, while pruning introduces moderate quality degradation. Critically, the ordering of techniques significantly affects the final model quality: the sequence Pruning, Knowledge Distillation, Quantization (P-KD-Q) yields the best balance, achieving a 3.68x compression ratio while preserving strong instruction-following and language understanding capabilities. Conversely, pipelines applying quantization early suffer severe performance degradation due to irreversible information loss that impairs subsequent training. Overall, this study offers practical insight into designing effective, ordering-aware compression pipelines for deploying LLMs in resource-limited settings.

A Systematic Study of Compression Ordering for Large Language Models

TL;DR

The paper tackles the problem of how to sequence compression techniques—Knowledge Distillation (KD), Structured Pruning (P), and Quantization (Q)—to efficiently deploy large language models. It systematically evaluates independent techniques and six three-technique orderings on Qwen2.5-based models, using perplexity, G-Eval, prompt alignment, and clarity as evaluation criteria. The key finding is that quantization provides the strongest standalone compression, but the final model quality depends heavily on ordering; notably, Pruning → Knowledge Distillation → Quantization (P-KD-Q) achieves the best balance with a compression ratio of about -fold while preserving instruction-following capabilities. This work offers a practical, ordering-aware pipeline for deploying compressed LLMs in resource-constrained environments and motivates future exploration of adaptive pruning and mixed-precision quantization across larger Qwen variants and multimodal models.

Abstract

Large Language Models (LLMs) require substantial computational resources, making model compression essential for efficient deployment in constrained environments. Among the dominant compression techniques: knowledge distillation, structured pruning, and low-bit quantization, their individual effects are well studied, but their interactions and optimal sequencing remain unclear. This work systematically examines how these techniques perform both independently and in combination when applied to the Qwen2.5 3B model. We evaluate multiple compression pipelines, including single, and proposed three-technique sequences, using perplexity, G-Eval, clarity, prompt alignment, and compression ratio as metrics. Our experiments show that quantization provides the greatest standalone compression, while pruning introduces moderate quality degradation. Critically, the ordering of techniques significantly affects the final model quality: the sequence Pruning, Knowledge Distillation, Quantization (P-KD-Q) yields the best balance, achieving a 3.68x compression ratio while preserving strong instruction-following and language understanding capabilities. Conversely, pipelines applying quantization early suffer severe performance degradation due to irreversible information loss that impairs subsequent training. Overall, this study offers practical insight into designing effective, ordering-aware compression pipelines for deploying LLMs in resource-limited settings.

Paper Structure

This paper contains 16 sections, 2 equations, 1 table.