Table of Contents
Fetching ...

Sequential Monte Carlo Steering of Large Language Models using Probabilistic Programs

Alexander K. Lew, Tan Zhi-Xuan, Gabriel Grand, Vikash K. Mansinghka

TL;DR

This paper tackles the challenge of reliably constraining LLM outputs at inference time beyond prompting and fine-tuning. It proposes sequential Monte Carlo steering, recasting generation as posterior inference in Feynman-Kac Transformer models and replacing standard decoding with particle-based SMC. The key contributions are (i) the Feynman-Kac formulation for constrained generation, (ii) the SMC steering algorithm with shared Transformer caching and a without-replacement resampling strategy, and (iii) the LLaMPPL library for building language-model probabilistic programs and automating steering. The approach achieves comparable computational cost to beam search while enabling sampling from constrained posteriors and supports tasks such as hard constraints, infilling, and prompt intersection, with improved sample quality through better proposals. The work thus provides a scalable framework to control LLM outputs with probabilistic guarantees and modular task composition.

Abstract

Even after fine-tuning and reinforcement learning, large language models (LLMs) can be difficult, if not impossible, to control reliably with prompts alone. We propose a new inference-time approach to enforcing syntactic and semantic constraints on the outputs of LLMs, called sequential Monte Carlo (SMC) steering. The key idea is to specify language generation tasks as posterior inference problems in a class of discrete probabilistic sequence models, and replace standard decoding with sequential Monte Carlo inference. For a computational cost similar to that of beam search, SMC can steer LLMs to solve diverse tasks, including infilling, generation under syntactic constraints, and prompt intersection. To facilitate experimentation with SMC steering, we present a probabilistic programming library, LLaMPPL (https://github.com/probcomp/hfppl), for concisely specifying new generation tasks as language model probabilistic programs, and automating steering of LLaMA-family Transformers.

Sequential Monte Carlo Steering of Large Language Models using Probabilistic Programs

TL;DR

This paper tackles the challenge of reliably constraining LLM outputs at inference time beyond prompting and fine-tuning. It proposes sequential Monte Carlo steering, recasting generation as posterior inference in Feynman-Kac Transformer models and replacing standard decoding with particle-based SMC. The key contributions are (i) the Feynman-Kac formulation for constrained generation, (ii) the SMC steering algorithm with shared Transformer caching and a without-replacement resampling strategy, and (iii) the LLaMPPL library for building language-model probabilistic programs and automating steering. The approach achieves comparable computational cost to beam search while enabling sampling from constrained posteriors and supports tasks such as hard constraints, infilling, and prompt intersection, with improved sample quality through better proposals. The work thus provides a scalable framework to control LLM outputs with probabilistic guarantees and modular task composition.

Abstract

Even after fine-tuning and reinforcement learning, large language models (LLMs) can be difficult, if not impossible, to control reliably with prompts alone. We propose a new inference-time approach to enforcing syntactic and semantic constraints on the outputs of LLMs, called sequential Monte Carlo (SMC) steering. The key idea is to specify language generation tasks as posterior inference problems in a class of discrete probabilistic sequence models, and replace standard decoding with sequential Monte Carlo inference. For a computational cost similar to that of beam search, SMC can steer LLMs to solve diverse tasks, including infilling, generation under syntactic constraints, and prompt intersection. To facilitate experimentation with SMC steering, we present a probabilistic programming library, LLaMPPL (https://github.com/probcomp/hfppl), for concisely specifying new generation tasks as language model probabilistic programs, and automating steering of LLaMA-family Transformers.
Paper Structure (7 sections, 11 equations, 4 figures, 1 algorithm)

This paper contains 7 sections, 11 equations, 4 figures, 1 algorithm.

Figures (4)

  • Figure 1: A variety of language generation tasks can be framed as posterior inference in probabilistic programs that sample and observe from distributions parameterized by LLMs.
  • Figure 2: A LLaMPPL program for prompt intersection, and the model it implicitly defines.
  • Figure 3: Example trie of prompts generated in the first few steps of an SMC algorithm on the constraint model from Figure \ref{['fig:examples']}. Our system maintains such a trie and at each node, caches next-token logits and layerwise key/value vectors for the token-in-context.
  • Figure 4: Results of SMC steering on the prompt intersection task from §\ref{['sec:examples']}, modified to emit EOS after one sentence. Left: We plot mean values of $\log \hat{Z}$ across 10 runs of SMC steering, with varying numbers of particles $N$, fixed expansion factor $K=3$, and the two Feynman-Kac models for prompt intersection given in §\ref{['sec:examples']}. In the first model, the Markov kernel $M_t$ proposes tokens according to only the first prompt ("My favorite writer is probably"), and the potential $G_t$ conditions on agreement from the second prompt. In the second model, the Markov kernel $M_t$ samples a locally optimal proposal distribution based on logits from both prompts, and $G_t$ serves as an importance weight. Right: Higher $\mathbb{E}[\log \hat{Z}]$ corresponds to qualitatively better samples. Indeed, by Jensen's inequality, $\mathbb{E}[\log \hat{Z}]$ is a lower bound on $\log Z$, and the gap is itself an upper bound on the KL divergence between SMC steering's sampling distribution and the true posterior $\mathbb{P}$.