Self-Supervised Alignment with Mutual Information: Learning to Follow Principles without Preference Labels

Jan-Philipp Fränken; Eric Zelikman; Rafael Rafailov; Kanishk Gandhi; Tobias Gerstenberg; Noah D. Goodman

Self-Supervised Alignment with Mutual Information: Learning to Follow Principles without Preference Labels

Jan-Philipp Fränken, Eric Zelikman, Rafael Rafailov, Kanishk Gandhi, Tobias Gerstenberg, Noah D. Goodman

TL;DR

SAMI tackles aligning pretrained LMs to behavioral principles without human preference labels by optimizing the conditional mutual information $I(Y; C|X)$ between self-generated responses and constitutions given prompts. It formulates a tractable InfoNCE-based objective with a two-sided contrast (over responses and over constitutions) and employs an Expert Iteration–style training loop, plus regularization to prevent degeneration. Empirically, SAMI improves alignment on dialogue tasks beyond the base model and matches or exceeds instruction-finetuned baselines on summarization, with scalable results on stronger models like llama3-70b and with diverse, held-out principles. The work demonstrates that latent regularities in base models can be exploited to follow constitutions without labels, enabling scalable alignment across domains and model scales.

Abstract

When prompting a language model (LM), users often expect the model to adhere to a set of behavioral principles across diverse tasks, such as producing insightful content while avoiding harmful or biased language. Instilling such principles (i.e., a constitution) into a model is resource-intensive, technically challenging, and generally requires human preference labels or examples. We introduce SAMI, an iterative algorithm that finetunes a pretrained language model (without requiring preference labels or demonstrations) to increase the conditional mutual information between constitutions and self-generated responses given queries from a dataset. On single-turn dialogue and summarization, a SAMI-trained mistral-7b outperforms the initial pretrained model, with win rates between 66% and 77%. Strikingly, it also surpasses an instruction-finetuned baseline (mistral-7b-instruct) with win rates between 55% and 57% on single-turn dialogue. SAMI requires a model that writes the principles. To avoid dependence on strong models for writing principles, we align a strong pretrained model (mixtral-8x7b) using constitutions written by a weak instruction-finetuned model (mistral-7b-instruct), achieving a 65% win rate on summarization. Finally, we investigate whether SAMI generalizes to diverse summarization principles (e.g., "summaries should be scientific") and scales to stronger models (llama3-70b), finding that it achieves win rates of up to 68% for learned and 67% for held-out principles compared to the base model. Our results show that a pretrained LM can learn to follow constitutions without using preference labels, demonstrations, or human oversight.

Self-Supervised Alignment with Mutual Information: Learning to Follow Principles without Preference Labels

TL;DR

SAMI tackles aligning pretrained LMs to behavioral principles without human preference labels by optimizing the conditional mutual information

between self-generated responses and constitutions given prompts. It formulates a tractable InfoNCE-based objective with a two-sided contrast (over responses and over constitutions) and employs an Expert Iteration–style training loop, plus regularization to prevent degeneration. Empirically, SAMI improves alignment on dialogue tasks beyond the base model and matches or exceeds instruction-finetuned baselines on summarization, with scalable results on stronger models like llama3-70b and with diverse, held-out principles. The work demonstrates that latent regularities in base models can be exploited to follow constitutions without labels, enabling scalable alignment across domains and model scales.

Abstract

Paper Structure (28 sections, 8 equations, 10 figures, 4 tables, 1 algorithm)

This paper contains 28 sections, 8 equations, 10 figures, 4 tables, 1 algorithm.

Introduction
Related Work
Self-Supervised Alignment with Mutual Information
Experiments and Results
Experiment 1: Dialogue
Experiment 2: Weak and Strong Principle Writer
Results: Weak Principle Writer
Results: Strong Principle Writer
Experiment 3: Scaling to Stronger Models and Diverse Principles
Limitations and Conclusion
Appendix
Broader Impacts
Derivation
Hyperparameters
PyTorch Implementation
...and 13 more sections

Figures (10)

Figure 1: SAMI Illustration. [a]: A user instructs an LM (the "principle writer") to write a set of principles and their antitheses, from which we sample constitutions. [b] Constitutions are then paired with queries from a dataset to sample responses by prompting an LM (the target model for finetuning). [c] Constitutions and responses are used to create contrastive pairs from which we obtain the log probabilities of the generated responses under different constitutions. This setup allows us to maximize a lower bound on the conditional mutual information$I(y;c|x)$ between responses $y$ and constitutions $c$ given queries $x$. SAMI optimizes this bound by minimizing the row- and column-wise cross-entropy loss between the normalized log probabilities and an identity matrix.
Figure 2: Experiment 1: Dialogue (HH-RLHF). We finetune mistral-7b (weak model) in both panels using principles written with claude-opus (strong principle writer). [a] Left: Conditional MI lower bound at each iteration. The dashed line indicates the MI for mistral-7b-instruct as a reference. Right: Average sequence length at each iteration. The dashed line represents the sequence length of mistral-7b-instruct. [b] Left: Length-corrected win rates against base model (mistral-7b). Right: Length-corrected win rates against instruct model (mistral-7b-instruct). We include $0.5$ (chance) as a reference point for iteration $t=0$ when comparing to the base model. Error bars correspond to $\pm$ SEM across 250 data points for all panels.
Figure 3: Experiment 2: Summarization (TL;DR). Conditional MI and Sequence Length. [a] Left: Conditional MI lower bound at each iteration (TL;DR only) for finetuned mistral-7b and mixtral-8x7b for principles written by mistral-7b-instruct. The dashed line indicates the MI for mistral-7b-instruct. Right: Average sequence length for mistral-7b and mixtral-8x7b on the TL;DR dataset using principles written by mistral-7b-instruct. The dashed line represents the sequence length of mistral-7b-instruct. [b] Left: Conditional MI lower bound at each iteration, using the same settings as in [a] but with principles written by claude-opus. Right: Average sequence length, using the same settings as in the right panel of [a], but with principles written by claude-opus. Dashed lines correspond to MI and sequence lengths from the instruct version of a model. Error bars correspond to $\pm$ SEM across 250 data points for all panels.
Figure 4: Experiment 2: Summarization (TL;DR). Win Rates. [a] Left: Win rates against base models (mistral-7b, mixtral-7x8b) using principles written by mistral-7b-instruct, where each finetuned model is compared to its corresponding base model. Right: Win rates of finetuned mistral-7b and mixtral-7x8b models, both against the instruct model (mistral-7b-instruct), using principles written by mistral-7b-instruct. We include $0.5$ (chance) as a reference point for iteration $t=0$ when comparing to a base model. [b] Left: Win rates against base models, using the same settings as in [a] but with principles written by claude-opus. Right: Win rates of finetuned models against the instruct model, using the same settings as in the right panel of [a], but with principles written by claude-opus. Error bars correspond to $\pm$ SEM across 250 data points for all panels.
Figure 5: Experiment 3: Diverse Summarization Principles. Win rates of the finetuned llama3-70b model against the base model for principles used during training ("train") and held-out ("test") principles, with and without chain-of-thought (CoT) (see \ref{['asec:tldrpromptsllama']}). Error bars correspond to $\pm$ SEM across 250 data points.
...and 5 more figures

Self-Supervised Alignment with Mutual Information: Learning to Follow Principles without Preference Labels

TL;DR

Abstract

Self-Supervised Alignment with Mutual Information: Learning to Follow Principles without Preference Labels

Authors

TL;DR

Abstract

Table of Contents

Figures (10)