Unsupervised Elicitation of Language Models
Jiaxin Wen, Zachary Ankner, Arushi Somani, Peter Hase, Samuel Marks, Jacob Goldman-Wetzler, Linda Petrini, Henry Sleight, Collin Burns, He He, Shi Feng, Ethan Perez, Jan Leike
TL;DR
The paper tackles the challenge of aligning pretrained LMs to complex tasks without relying on human supervision. It introduces Internal Coherence Maximization (ICM), an unsupervised algorithm that elicits labels by maximizing mutual predictability and enforcing logical consistency, approximated via simulated annealing. Empirically, ICM matches golden-label performance on GSM8K and TruthfulQA, outperforms crowdsourced supervision on Alpaca, and demonstrates superhuman-style elicitation in gender prediction; it also enables training of a Claude 4 Sonnet-based assistant without human labels, achieving competitive results. The work suggests unsupervised elicitation as a viable, scalable path for aligning frontier LMs with humanintent, while acknowledging limitations such as the need for salient concepts and concerns about data contamination.
Abstract
To steer pretrained language models for downstream tasks, today's post-training paradigm relies on humans to specify desired behaviors. However, for models with superhuman capabilities, it is difficult or impossible to get high-quality human supervision. To address this challenge, we introduce a new unsupervised algorithm, Internal Coherence Maximization (ICM), to fine-tune pretrained language models on their own generated labels, \emph{without external supervision}. On GSM8k-verification, TruthfulQA, and Alpaca reward modeling tasks, our method matches the performance of training on golden labels and outperforms training on crowdsourced human supervision. On tasks where LMs' capabilities are strongly superhuman, our method can elicit those capabilities significantly better than training on human labels. Finally, we show that our method can improve the training of frontier LMs: we use our method to train an unsupervised reward model and use reinforcement learning to train a Claude 4 Sonnet-based assistant. The resulting assistant matches its counterpart trained on production-grade human labels on average, with higher scores on chat and safety yet lower scores on math and coding.
