Is Free Self-Alignment Possible?
Dyah Adila, Changho Shin, Yijing Zhang, Frederic Sala
TL;DR
AlignEZ presents a training-free framework for aligning pretrained LMs by harvesting self-generated preference data and performing at-inference representation editing. By identifying an alignment subspace with SVD and editing embeddings along carefully filtered directions, it achieves substantial alignment gains across multiple tasks and model scales without external labels. The approach supports multi-objective control, accelerates expensive methods like DPO with limited ground-truth data, and can enhance specialized reasoning capabilities, albeit with diminishing returns when scaling self-generated data. Overall, AlignEZ offers a practical, scalable path to pluralistic alignment and rapid model personalization, leveraging intrinsic pretraining signals rather than costly fine-tuning. Theoretical results clarify how latent concepts are shifted by targeted edits, while extensive experiments demonstrate robust improvements across math, coding, reasoning, and safety-related tasks.
Abstract
Aligning pretrained language models (LMs) often requires large-scale preference data and substantial computational resources. These costs become even more prohibitive for multi-objective or pluralistic alignment. Is this truly necessary? Can we perform efficient alignment using only internal model capabilities, and without additional training? To answer this question, we propose AlignEZ, a novel approach that leverages (1) self-generated preference data and (2) representation editing to achieve cost-effective, efficient alignment. By operating directly on learned representations, AlignEZ independently targets different behavioral aspects without the overhead of traditional alignment methods. Our experiments reveal that this cost-efficient procedure improves performance across diverse tasks: up to 19.9% on general alignment and 1.9% on challenging mathematical reasoning tasks, even when starting from a strong base model. AlignEZ can also align models to multiple objectives simultaneously, granting fine-grained control over multiple preference axes. Finally, we show that AlignEZ can accelerate more expensive alignment procedures--such as DPO--even under limited availability of ground-truth preference data.
