Table of Contents
Fetching ...

Compressing Language Models for Specialized Domains

Miles Williams, George Chrysostomou, Vitor Jeronymo, Nikolaos Aletras

TL;DR

This work tackles the problem of domain-specific degeneration when compressing large language models by introducing cross-calibration, a training-free Hessian-based method that blends domain-focused and general knowledge. By decomposing the Hessian into domain-specific and general components and combining them via a regularization parameter, cross-calibration identifies weights influential for both in-domain and general performance without retraining. Empirical results across biomedical and legal domains show CC outperforms existing domain-aware pruning methods while preserving general capabilities, and it remains effective when combined with quantization, all with comparable or lower computational overhead. The approach is demonstrated to be language-agnostic and scalable across model families and sizes, enabling practical deployment of domain-specialized compressed LMs with minimal overhead and broad applicability.

Abstract

Compression techniques such as pruning and quantization offer a solution for more efficient deployment of language models (LMs), albeit with small performance drops in benchmark performance. However, general-purpose LM compression methods can negatively affect performance in specialized domains (e.g. biomedical or legal). Recent work has sought to address this, yet requires computationally expensive full-parameter fine-tuning. To this end, we propose cross-calibration, a novel training-free approach for improving the domain performance of compressed LMs. Our approach effectively leverages Hessian-based sensitivity to identify weights that are influential for both in-domain and general performance. Through extensive experimentation, we demonstrate that cross-calibration substantially outperforms existing approaches on domain-specific tasks, without compromising general performance. Notably, these gains come without additional computational overhead, displaying remarkable potential towards extracting domain-specialized compressed models from general-purpose LMs.

Compressing Language Models for Specialized Domains

TL;DR

This work tackles the problem of domain-specific degeneration when compressing large language models by introducing cross-calibration, a training-free Hessian-based method that blends domain-focused and general knowledge. By decomposing the Hessian into domain-specific and general components and combining them via a regularization parameter, cross-calibration identifies weights influential for both in-domain and general performance without retraining. Empirical results across biomedical and legal domains show CC outperforms existing domain-aware pruning methods while preserving general capabilities, and it remains effective when combined with quantization, all with comparable or lower computational overhead. The approach is demonstrated to be language-agnostic and scalable across model families and sizes, enabling practical deployment of domain-specialized compressed LMs with minimal overhead and broad applicability.

Abstract

Compression techniques such as pruning and quantization offer a solution for more efficient deployment of language models (LMs), albeit with small performance drops in benchmark performance. However, general-purpose LM compression methods can negatively affect performance in specialized domains (e.g. biomedical or legal). Recent work has sought to address this, yet requires computationally expensive full-parameter fine-tuning. To this end, we propose cross-calibration, a novel training-free approach for improving the domain performance of compressed LMs. Our approach effectively leverages Hessian-based sensitivity to identify weights that are influential for both in-domain and general performance. Through extensive experimentation, we demonstrate that cross-calibration substantially outperforms existing approaches on domain-specific tasks, without compromising general performance. Notably, these gains come without additional computational overhead, displaying remarkable potential towards extracting domain-specialized compressed models from general-purpose LMs.

Paper Structure

This paper contains 61 sections, 10 equations, 9 figures, 10 tables, 1 algorithm.

Figures (9)

  • Figure 1: Compressing large general-purpose LMs into smaller domain-specific models.
  • Figure 2: The Hessian at layer 16 of Mistral NeMo 12B, computed with (a) generic calibration data, and (b) domain-specific calibration data. For clarity, we present the magnitude of the elements for the first 32 features.
  • Figure 3: The average benchmark accuracy when pruning to 50% sparsity, relative to the original model.
  • Figure 4: Average accuracy when applying 4-bit quantization and 2:4 sparsity, relative to the dense model.
  • Figure 5: The average duration and peak memory allocated when pruning Llama 3.1 8B with each method, as measured using an NVIDIA A100 80GB GPU.
  • ...and 4 more figures