Is Free Self-Alignment Possible?

Dyah Adila; Changho Shin; Yijing Zhang; Frederic Sala

Is Free Self-Alignment Possible?

Dyah Adila, Changho Shin, Yijing Zhang, Frederic Sala

TL;DR

AlignEZ presents a training-free framework for aligning pretrained LMs by harvesting self-generated preference data and performing at-inference representation editing. By identifying an alignment subspace with SVD and editing embeddings along carefully filtered directions, it achieves substantial alignment gains across multiple tasks and model scales without external labels. The approach supports multi-objective control, accelerates expensive methods like DPO with limited ground-truth data, and can enhance specialized reasoning capabilities, albeit with diminishing returns when scaling self-generated data. Overall, AlignEZ offers a practical, scalable path to pluralistic alignment and rapid model personalization, leveraging intrinsic pretraining signals rather than costly fine-tuning. Theoretical results clarify how latent concepts are shifted by targeted edits, while extensive experiments demonstrate robust improvements across math, coding, reasoning, and safety-related tasks.

Abstract

Aligning pretrained language models (LMs) often requires large-scale preference data and substantial computational resources. These costs become even more prohibitive for multi-objective or pluralistic alignment. Is this truly necessary? Can we perform efficient alignment using only internal model capabilities, and without additional training? To answer this question, we propose AlignEZ, a novel approach that leverages (1) self-generated preference data and (2) representation editing to achieve cost-effective, efficient alignment. By operating directly on learned representations, AlignEZ independently targets different behavioral aspects without the overhead of traditional alignment methods. Our experiments reveal that this cost-efficient procedure improves performance across diverse tasks: up to 19.9% on general alignment and 1.9% on challenging mathematical reasoning tasks, even when starting from a strong base model. AlignEZ can also align models to multiple objectives simultaneously, granting fine-grained control over multiple preference axes. Finally, we show that AlignEZ can accelerate more expensive alignment procedures--such as DPO--even under limited availability of ground-truth preference data.

Is Free Self-Alignment Possible?

TL;DR

Abstract

Paper Structure (63 sections, 6 theorems, 42 equations, 6 figures, 11 tables)

This paper contains 63 sections, 6 theorems, 42 equations, 6 figures, 11 tables.

Introduction
$\textsc{AlignEZ}$: Cost-effective LM Alignment
Self-generated Preference Data
Identifying Alignment Subspace
Sample-conditional estimation of $\Theta_l^{align}$.
Alignment via Embedding Editing
Selecting Layers for Intervention.
Theoretical Analysis
Removing Harmful Component
Boosting Helpful Component
Experiments
Improving Pretrained Model Alignment
Setup.
Metrics.
Datasets.
...and 48 more sections

Key Result

Theorem 3.1

Under the noise model described above, the coefficient $\alpha^{harm}_{s, -}$ for the harmful concept $z_s$ satisfies

Figures (6)

Figure 1: Training with DPO (blue) in time-constrained scenarios permits using only a few samples and produces poor alignment even as sample size increases (x-axis). $\textsc{AlignEZ}$ (pink) achieves alignment gains even with limited time, as it is training free.
Figure 2: Left to right: (1) Prompt the model for helpful vs. harmful traits (top), then generate noisy preference pairs (bottom). (2) $\textsc{AlignEZ}$ identifies alignment-relevant subspaces using only this self-generated data. (3) Apply subspace-based representation editing at inference time. (4) Example outputs from $\textsc{AlignEZ}$ (top) vs. the base model (bottom).
Figure 3: $\textsc{AlignEZ}$ enables fine-grained control over different alignment axes, demonstrated through reward scores across different steering strengths. Diagonal patterns indicate successful independent control, while correlated preferences (helpful, harmless) show less independent control. Cosine similarity quantifies the average similarity between alignment vectors from different preference groups.
Figure 4: $\textsc{AlignEZ}$ achieves superior multi-preference control compared to prompted base and RLHF models.
Figure 5: DPO with 1% data + $\textsc{AlignEZ}$ matches the performance of DPO with 25% data (blue dashed line).
...and 1 more figures

Theorems & Definitions (10)

Theorem 3.1
Theorem 3.2
Theorem 3.1
proof
Theorem 3.2
proof
Theorem 3.3
proof
Theorem 3.4
proof

Is Free Self-Alignment Possible?

TL;DR

Abstract

Is Free Self-Alignment Possible?

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (10)