Self-Supervised Alignment with Mutual Information: Learning to Follow Principles without Preference Labels
Jan-Philipp Fränken, Eric Zelikman, Rafael Rafailov, Kanishk Gandhi, Tobias Gerstenberg, Noah D. Goodman
TL;DR
SAMI tackles aligning pretrained LMs to behavioral principles without human preference labels by optimizing the conditional mutual information $I(Y; C|X)$ between self-generated responses and constitutions given prompts. It formulates a tractable InfoNCE-based objective with a two-sided contrast (over responses and over constitutions) and employs an Expert Iteration–style training loop, plus regularization to prevent degeneration. Empirically, SAMI improves alignment on dialogue tasks beyond the base model and matches or exceeds instruction-finetuned baselines on summarization, with scalable results on stronger models like llama3-70b and with diverse, held-out principles. The work demonstrates that latent regularities in base models can be exploited to follow constitutions without labels, enabling scalable alignment across domains and model scales.
Abstract
When prompting a language model (LM), users often expect the model to adhere to a set of behavioral principles across diverse tasks, such as producing insightful content while avoiding harmful or biased language. Instilling such principles (i.e., a constitution) into a model is resource-intensive, technically challenging, and generally requires human preference labels or examples. We introduce SAMI, an iterative algorithm that finetunes a pretrained language model (without requiring preference labels or demonstrations) to increase the conditional mutual information between constitutions and self-generated responses given queries from a dataset. On single-turn dialogue and summarization, a SAMI-trained mistral-7b outperforms the initial pretrained model, with win rates between 66% and 77%. Strikingly, it also surpasses an instruction-finetuned baseline (mistral-7b-instruct) with win rates between 55% and 57% on single-turn dialogue. SAMI requires a model that writes the principles. To avoid dependence on strong models for writing principles, we align a strong pretrained model (mixtral-8x7b) using constitutions written by a weak instruction-finetuned model (mistral-7b-instruct), achieving a 65% win rate on summarization. Finally, we investigate whether SAMI generalizes to diverse summarization principles (e.g., "summaries should be scientific") and scales to stronger models (llama3-70b), finding that it achieves win rates of up to 68% for learned and 67% for held-out principles compared to the base model. Our results show that a pretrained LM can learn to follow constitutions without using preference labels, demonstrations, or human oversight.
