Likelihood hacking in probabilistic program synthesis

Jacek Karwowski; Younesse Kaddar; Zihuiwen Ye; Nikolay Malkin; Sam Staton

Likelihood hacking in probabilistic program synthesis

Jacek Karwowski, Younesse Kaddar, Zihuiwen Ye, Nikolay Malkin, Sam Staton

Abstract

When language models are trained by reinforcement learning (RL) to write probabilistic programs, they can artificially inflate their marginal-likelihood reward by producing programs whose data distribution fails to normalise instead of fitting the data better. We call this failure likelihood hacking (LH). We formalise LH in a core probabilistic programming language (PPL) and give sufficient syntactic conditions for its prevention, proving that a safe language fragment $\mathcal{L}_{\text{safe}}$ satisfying these conditions cannot produce likelihood-hacking programs. Empirically, we show that GRPO-trained models generating PyMC code discover LH exploits within the first few training steps, driving violation rates well above the untrained-model baseline. We implement $\mathcal{L}_{\text{safe}}$'s conditions as $\texttt{SafeStan}$, a LH-resistant modification of Stan, and show empirically that it prevents LH under optimisation pressure. These results show that language-level safety constraints are both theoretically grounded and effective in practice for automated Bayesian model discovery.

Likelihood hacking in probabilistic program synthesis

Abstract

satisfying these conditions cannot produce likelihood-hacking programs. Empirically, we show that GRPO-trained models generating PyMC code discover LH exploits within the first few training steps, driving violation rates well above the untrained-model baseline. We implement

's conditions as

, a LH-resistant modification of Stan, and show empirically that it prevents LH under optimisation pressure. These results show that language-level safety constraints are both theoretically grounded and effective in practice for automated Bayesian model discovery.

Paper Structure (43 sections, 8 theorems, 38 equations, 9 figures, 1 table)

This paper contains 43 sections, 8 theorems, 38 equations, 9 figures, 1 table.

Introduction
Related work
Reward hacking and objective misspecification.
AI scientist framing.
Probabilistic programming language design.
LLMs with PPLs and automated modeling.
Probabilistic circuits.
Probabilistic program synthesis and likelihood hacking
Probabilistic programming
Probabilistic program semantics
Setting for program synthesis
Likelihood hacking
Empirical demonstration
Experimental setup
Emergence of likelihood hacking
...and 28 more sections

Key Result

Theorem 1

If $\Gamma;\Delta \vdash_{\text{s}} p : P(T)$ then $p$ does not likelihood hack, that is, for every choice of $\rho_\Gamma \in \llbracket \Gamma \rrbracket$, we have:

Figures (9)

Figure 1: Training loop for probabilistic program synthesis. A program generator proposes candidate programs conditioned on a natural-language prompt and the training dataset. An inference engine evaluates each candidate by computing its negative log-likelihood (loss) on test data, according to the program itself. Note that programs can likelihood hack, e.g., by improperly scoring the input (red program).
Figure 2: Example programs in $\mathcal{L}_{\text{unsafe}}$. We fix a simple interface with one covariate $\Gamma \overset{\Delta}{=} (x:\mathbb{R})$, one datapoint $\Delta \overset{\Delta}{=} (\mathtt{y}:\mathbb{R})$, and output $P(\mathbb{R})$. All four programs are well-typed in $\mathcal{L}_{\text{unsafe}}$, but only (a) corresponds to a properly normalised probabilistic model over the interface. Programs (b)--(d) illustrate distinct likelihood-hacking mechanisms enabled by $\mathcal{L}_{\text{unsafe}}$.
Figure 3: Four PyMC programs generated during GRPO training. (a) Conjugate Beta--Bernoulli; normalises to $M=1$. (b) The pm.Potential labelled "regularisation" disguises score injection as regularisation. (c) Compound exploit: data['y'] is observed twice (Beta + Bernoulli) and score is injected via Potential. (d) An exploit that conditions deterministic computation of the model parameter on the observation. Programs (b), (c), and (d) are flagged by the normalisation check (\ref{['sec:setup']}). More examples are listed in \ref{['app:exploit-catalog']}.
Figure 4: Typing rules for pure terms in $\mathcal{L}_{\text{unsafe}}$.
Figure 5: Typing rules for distributions in $\mathcal{L}_{\text{unsafe}}$.
...and 4 more figures

Theorems & Definitions (31)

Definition 1
Definition 2: Experimental setup
Definition 3: Likelihood hacking
Definition 4: LH-safety
Example 1: Non-likelihood hacking programs
Theorem 1: Soundness of $\mathcal{L}_{\text{safe}}$
proof
Corollary 1: Protocol safety for $\mathcal{L}_{\text{safe}}$
proof
Remark 1
...and 21 more

Likelihood hacking in probabilistic program synthesis

Abstract

Likelihood hacking in probabilistic program synthesis

Authors

Abstract

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (31)