Language Models Resist Alignment: Evidence From Data Compression

Jiaming Ji; Kaile Wang; Tianyi Qiu; Boyuan Chen; Jiayi Zhou; Changye Li; Hantao Lou; Juntao Dai; Yunhuai Liu; Yaodong Yang

Language Models Resist Alignment: Evidence From Data Compression

Jiaming Ji, Kaile Wang, Tianyi Qiu, Boyuan Chen, Jiayi Zhou, Changye Li, Hantao Lou, Juntao Dai, Yunhuai Liu, Yaodong Yang

TL;DR

This paper identifies elasticity as a fundamental mechanism by which language models resist alignment, modeling alignment dynamics through data compression and a token-tree framework. It formalizes elasticity via inverse-alignment behavior and a Hooke's-law analogy, and empirically validates resistance and rebound across model sizes and pre-training data scales. The core contribution is both theoretical (compression-based derivations showing dataset-size–dependent changes) and empirical (demonstrating resistance, rebound, and their scaling with data and model size). The findings underscore the need for robust, data-aware alignment strategies and have implications for open-sourcing and long-term safety of LLMs.

Abstract

Large language models (LLMs) may exhibit unintended or undesirable behaviors. Recent works have concentrated on aligning LLMs to mitigate harmful outputs. Despite these efforts, some anomalies indicate that even a well-conducted alignment process can be easily circumvented, whether intentionally or accidentally. Does alignment fine-tuning yield have robust effects on models, or are its impacts merely superficial? In this work, we make the first exploration of this phenomenon from both theoretical and empirical perspectives. Empirically, we demonstrate the $\mathbf{elasticity}$ of post-alignment models, i.e., the tendency to revert to the behavior distribution formed during the pre-training phase upon further fine-tuning. Leveraging compression theory, we formally deduce that fine-tuning disproportionately undermines alignment relative to pre-training, potentially by orders of magnitude. We validate the presence of elasticity through experiments on models of varying types and scales. Specifically, we find that model performance declines rapidly before reverting to the pre-training distribution, after which the rate of decline drops significantly. Furthermore, we further reveal that elasticity positively correlates with the increased model size and the expansion of pre-training data. Our findings underscore the need to address the inherent elasticity of LLMs to mitigate their resistance to alignment. The model weight and code are available at pku-lm-resist-alignment.github.io.

Language Models Resist Alignment: Evidence From Data Compression

TL;DR

Abstract

of post-alignment models, i.e., the tendency to revert to the behavior distribution formed during the pre-training phase upon further fine-tuning. Leveraging compression theory, we formally deduce that fine-tuning disproportionately undermines alignment relative to pre-training, potentially by orders of magnitude. We validate the presence of elasticity through experiments on models of varying types and scales. Specifically, we find that model performance declines rapidly before reverting to the pre-training distribution, after which the rate of decline drops significantly. Furthermore, we further reveal that elasticity positively correlates with the increased model size and the expansion of pre-training data. Our findings underscore the need to address the inherent elasticity of LLMs to mitigate their resistance to alignment. The model weight and code are available at pku-lm-resist-alignment.github.io.

Paper Structure (47 sections, 5 theorems, 36 equations, 10 figures, 3 tables)

This paper contains 47 sections, 5 theorems, 36 equations, 10 figures, 3 tables.

Introduction
Related Work
The Fragility of LLMs Alignment
What is Elasticity?
Preliminaries
Pre-training.
Supervised Fine-tuning (SFT).
Lossless Compression.
Compression and Prediction.
The Compression Protocol of LLMs
The Formal Definition of Elasticity
Why Elasticity Affects Alignment?
Formal Derivation of Elasticity
Elasticity and Inverse Alignment
Elasticity and the Hooke's Law.
...and 32 more sections

Key Result

Theorem 3.4

Consider a finite parameter model $p_{{\bm{\theta}}}\left(\cdot\right)$ training on dataset $\mathcal{D}$, the ideal code length $\mathcal{L}_{p_{{\bm{\theta}}}}\left({\bm{x}}\right)$ of a random response ${\bm{x}}$ compressed by $p_{{\bm{\theta}}}$ can be expressed as: where $d$ represents the depth of the $\mathcal{T}_{\mathcal{D}}$ after pruning under Definition def: compress_of_model protocol

Figures (10)

Figure 1: The Elasticity of Language Models. The change in normalized compression rates (${\textcolor{inverse}{\Delta\gamma_{p_{{\bm{\theta}}}}^{\mathcal{D}_i/\mathcal{D}}}}$) and the dataset volume ($\textcolor{forward}{|\mathcal{D}_i|}$) follows an inverse proportionality law after perturbations, which is akin to the relationship between spring deformation ($\textcolor{inverse}{\Delta l_i}$) and stiffness ($\textcolor{forward}{k_i}$) in coupled springs. We conjecture that the elasticity causes language models to resist alignment, enabling the possibility of inverse alignment.
Figure 2: Experiment pipeline for validating resistance. We conceptualize resistance as: inverse alignment is easier than forward alignment.
Figure 3: Experiment pipeline for validating rebound. We conceptualize rebound as: the more positive the post-trained models' performance, the more negative it becomes after inverse finetuning.
Figure 4: Experimental results for validating the existence of rebound (left: IMDb, right: Beavertails). The left part of each sub-figure is the performance of Gemma-2B while the right is Llama2-7B, respectively. Models trained with more positive data initially perform better but perform worse after fine-tuning with negative data.
Figure 5: Experimental results for validating rebound increases with model size (left: IMDb, right: Beavertails). All single line covers positive data volume settings as Figure \ref{['exp2: existence']}, with shadow denoting std. As the model size increases, the performance of the aligned model deteriorates more rapidly after fine-tuning with negative data.
...and 5 more figures

Theorems & Definitions (16)

Definition 3.1: Token Tree $\mathcal{T}$
Definition 3.3: The Compression Protocal
Theorem 3.4: Ideal Code Length
Definition 3.5: Inverse Alignment
Definition 3.6: The Elasticity of LLMs
Definition 4.1: Normalized Compression Rate
Theorem 4.2: Elasticity of Language Models
Theorem A.2: Ideal Code Length
proof
Definition A.3: Mass Distribution in Token Tree
...and 6 more

Language Models Resist Alignment: Evidence From Data Compression

TL;DR

Abstract

Language Models Resist Alignment: Evidence From Data Compression

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (10)

Theorems & Definitions (16)