Table of Contents
Fetching ...

Latent Principle Discovery for Language Model Self-Improvement

Keshav Ramji, Tahira Naseem, Ramón Fernandez Astudillo

TL;DR

The paper tackles the challenge of automating the discovery of human-aligned behavioral attributes for language model self-improvement. It introduces STaPLe, a posterior-regularized Monte Carlo EM framework that mines latent principles from the LM itself, then compresses them via hierarchical clustering into interpretable constitutions and trains the model to invoke these principles during refinement. Across iterative cycles, STaPLe yields improvements on instruction-following benchmarks (MT-Bench, AlpacaEval, IFEval) for multiple 7–8B models and scales to larger models in auxiliary experiments, with clustering preserving performance while enhancing interpretability. The results demonstrate a viable path toward autonomous, principle-driven post-training recipes for continual LM improvement, while acknowledging limitations and the value of human-in-the-loop oversight for safety and alignment. Overall, the work highlights how latent reasoning traces can guide intrinsic self-correction and offer a scalable, interpretable alternative to static constitutions.

Abstract

When language model (LM) users aim to improve the quality of its generations, it is crucial to specify concrete behavioral attributes that the model should strive to reflect. However, curating such principles across many domains, even non-exhaustively, requires a labor-intensive annotation process. To automate this process, we propose eliciting these latent attributes that guide model reasoning toward human-preferred responses by explicitly modeling them in a self-correction setting. Our approach mines new principles from the LM itself and compresses the discovered elements to an interpretable set via clustering. Specifically, we employ a form of posterior-regularized Monte Carlo Expectation-Maximization to both identify a condensed set of the most effective latent principles and teach the LM to strategically invoke them in order to intrinsically refine its responses. We demonstrate that bootstrapping our algorithm over multiple iterations enables smaller language models (7-8B parameters) to self-improve, achieving +8-10% in AlpacaEval win-rate, an average of +0.3 on MT-Bench, and +19-23% in principle-following win-rate on IFEval. We also show that clustering the principles yields interpretable and diverse model-generated constitutions while retaining model performance. The gains that our method achieves highlight the potential of automated, principle-driven post-training recipes toward continual self-improvement.

Latent Principle Discovery for Language Model Self-Improvement

TL;DR

The paper tackles the challenge of automating the discovery of human-aligned behavioral attributes for language model self-improvement. It introduces STaPLe, a posterior-regularized Monte Carlo EM framework that mines latent principles from the LM itself, then compresses them via hierarchical clustering into interpretable constitutions and trains the model to invoke these principles during refinement. Across iterative cycles, STaPLe yields improvements on instruction-following benchmarks (MT-Bench, AlpacaEval, IFEval) for multiple 7–8B models and scales to larger models in auxiliary experiments, with clustering preserving performance while enhancing interpretability. The results demonstrate a viable path toward autonomous, principle-driven post-training recipes for continual LM improvement, while acknowledging limitations and the value of human-in-the-loop oversight for safety and alignment. Overall, the work highlights how latent reasoning traces can guide intrinsic self-correction and offer a scalable, interpretable alternative to static constitutions.

Abstract

When language model (LM) users aim to improve the quality of its generations, it is crucial to specify concrete behavioral attributes that the model should strive to reflect. However, curating such principles across many domains, even non-exhaustively, requires a labor-intensive annotation process. To automate this process, we propose eliciting these latent attributes that guide model reasoning toward human-preferred responses by explicitly modeling them in a self-correction setting. Our approach mines new principles from the LM itself and compresses the discovered elements to an interpretable set via clustering. Specifically, we employ a form of posterior-regularized Monte Carlo Expectation-Maximization to both identify a condensed set of the most effective latent principles and teach the LM to strategically invoke them in order to intrinsically refine its responses. We demonstrate that bootstrapping our algorithm over multiple iterations enables smaller language models (7-8B parameters) to self-improve, achieving +8-10% in AlpacaEval win-rate, an average of +0.3 on MT-Bench, and +19-23% in principle-following win-rate on IFEval. We also show that clustering the principles yields interpretable and diverse model-generated constitutions while retaining model performance. The gains that our method achieves highlight the potential of automated, principle-driven post-training recipes toward continual self-improvement.

Paper Structure

This paper contains 67 sections, 1 theorem, 40 equations, 10 figures, 12 tables, 1 algorithm.

Key Result

Theorem 1

Assume the setting of an input $x$, an initial model response $y^1 \sim \pi_{\theta}(\cdot \mid x)$, a latent principle $z \sim \pi_{\theta}(\cdot \mid x,y^1,y^G)$, and a refinement $y^2 \sim \pi_{\theta}(\cdot \mid x,y^1,z)$. Then, the EM gradient for the STaPLe algorithm is equivalent to the REINF

Figures (10)

  • Figure 1: We introduce Self-Taught Principle Learning (STaPLe). (Left) Our Monte Carlo EM algorithm alternates between on-policy discovery and learning of latent principles guiding self-correction behavior. The principles may also be clustered to a compressed set, yielding human-interpretable constitutions $\mathcal{C}_t$ and models trained to follow them $\mathcal{M}_t$. (Right) The STaPLe algorithm induces self-improvement in AlpacaEval win-rate over three iterations for all three language models.
  • Figure 2: The figure above depicts the principle discovery (E-step) phase. We sample an initial response $y^1$ on-policy, then "hint" with the gold response to elicit candidate principles $z_{(1:N)}$. Then, we sample critiques on the initial response (only used in rejection sampling, and not included in the fine-tuning trajectories), which we use to obtain principle-guided refined responses $y^2_{(1:N)}$. The best refined response $\hat{y}^2$ is selected based on similarity to the gold response. We save the resulting trajectory, which is used for supervised fine-tuning in the principle learning (M-step) stage.
  • Figure 3: Principle discovery rates of the STaPLe algorithm in the unconstrained (left) and constrained (right) settings. This represents the fraction of the trajectories saved from the principle discovery process (E-step) that contain a unique principle label that was unseen in previous iterations.
  • Figure 4: Visualization of Table \ref{['table:prometheus-stepwise']}, comparing against the 50% baseline. While the win-rate exceeds 50%, the model continues to self-improve.
  • Figure 5: STaPLe refinement rates across 4 iterations for unconstrained STaPLe algorithm. This represents the fraction of samples in the mining corpus on which at least one principle-conditioned refinement attempt improved over the initial response.
  • ...and 5 more figures

Theorems & Definitions (2)

  • Theorem 1: Equivalence of EM and Self-Play Gradients
  • proof