Table of Contents
Fetching ...

Learning Tractable Distributions Of Language Model Continuations

Gwen Yidou-Weng, Ian Li, Anji Liu, Oliver Broadrick, Guy Van den Broeck, Benjie Wang

TL;DR

LTLA resolves the challenge of conditioning language models on sequence-level constraints by coupling a neural encoder with a fixed, tractable HMM to perform lookahead. The approach yields more accurate continuation distributions and better constraint satisfaction for both language and vision–language tasks, with modest decoding overhead. By encoding prefix information into the surrogate prior $p(z_t|x_{1:t})$ and keeping the HMM decoder fixed, LTLA reuses computations across candidate next tokens and prefixes, enabling scalable, exact queries $p(\alpha|x_{1:t})$. Empirically, LTLA outperforms unconditional HMM surrogates on predictive likelihood and improves performance on constrained generation benchmarks, including hard DFA constraints and soft toxicity constraints in VLMs, with applicability to multimodal contexts.

Abstract

Controlled language generation conditions text on sequence-level constraints (for example, syntax, style, or safety). These constraints may depend on future tokens, which makes directly conditioning an autoregressive language model (LM) generally intractable. Prior work uses tractable surrogates such as hidden Markov models (HMMs) to approximate the distribution over continuations and adjust the model's next-token logits at decoding time. However, we find that these surrogates are often weakly context aware, which reduces query quality. We propose Learning to Look Ahead (LTLA), a hybrid approach that pairs the same base language model for rich prefix encoding with a fixed tractable surrogate model that computes exact continuation probabilities. Two efficiency pitfalls arise when adding neural context: (i) naively rescoring the prefix with every candidate next token requires a sweep over the entire vocabulary at each step, and (ii) predicting fresh surrogate parameters for each prefix, although tractable at a single step, forces recomputation of future probabilities for every new prefix and eliminates reuse. LTLA avoids both by using a single batched HMM update to account for all next-token candidates at once, and by conditioning only the surrogate's latent state prior on the LM's hidden representations while keeping the surrogate decoder fixed, so computations can be reused across prefixes. Empirically, LTLA attains higher conditional likelihood than an unconditional HMM, approximates continuation distributions for vision-language models where a standalone HMM cannot encode visual context, and improves constraint satisfaction at comparable fluency on controlled-generation tasks, with minimal inference overhead.

Learning Tractable Distributions Of Language Model Continuations

TL;DR

LTLA resolves the challenge of conditioning language models on sequence-level constraints by coupling a neural encoder with a fixed, tractable HMM to perform lookahead. The approach yields more accurate continuation distributions and better constraint satisfaction for both language and vision–language tasks, with modest decoding overhead. By encoding prefix information into the surrogate prior and keeping the HMM decoder fixed, LTLA reuses computations across candidate next tokens and prefixes, enabling scalable, exact queries . Empirically, LTLA outperforms unconditional HMM surrogates on predictive likelihood and improves performance on constrained generation benchmarks, including hard DFA constraints and soft toxicity constraints in VLMs, with applicability to multimodal contexts.

Abstract

Controlled language generation conditions text on sequence-level constraints (for example, syntax, style, or safety). These constraints may depend on future tokens, which makes directly conditioning an autoregressive language model (LM) generally intractable. Prior work uses tractable surrogates such as hidden Markov models (HMMs) to approximate the distribution over continuations and adjust the model's next-token logits at decoding time. However, we find that these surrogates are often weakly context aware, which reduces query quality. We propose Learning to Look Ahead (LTLA), a hybrid approach that pairs the same base language model for rich prefix encoding with a fixed tractable surrogate model that computes exact continuation probabilities. Two efficiency pitfalls arise when adding neural context: (i) naively rescoring the prefix with every candidate next token requires a sweep over the entire vocabulary at each step, and (ii) predicting fresh surrogate parameters for each prefix, although tractable at a single step, forces recomputation of future probabilities for every new prefix and eliminates reuse. LTLA avoids both by using a single batched HMM update to account for all next-token candidates at once, and by conditioning only the surrogate's latent state prior on the LM's hidden representations while keeping the surrogate decoder fixed, so computations can be reused across prefixes. Empirically, LTLA attains higher conditional likelihood than an unconditional HMM, approximates continuation distributions for vision-language models where a standalone HMM cannot encode visual context, and improves constraint satisfaction at comparable fluency on controlled-generation tasks, with minimal inference overhead.

Paper Structure

This paper contains 22 sections, 2 theorems, 11 equations, 4 figures, 4 tables.

Key Result

Proposition 1

For any Markov chain $X_{< t} \to Z_t\to X_{\ge t}$, we have

Figures (4)

  • Figure 1: The encoder given by standard HMMs is often insensitive to information contained within the context. In this example, we show an example with the context they fired the <x> after just one, where <x> can be coach or employee. The distribution is almost identical for the standard HMM, while the neural HMM shows a significant shift in distribution (in particular, with season and game being more likely when <x> = coach).
  • Figure 2: Neural architectures for neural-encoded HMMs: (a) frozen Transformer with linear mapping, (b) frozen Transformer with additional learnable layer, (c) fully finetuned Transformer.
  • Figure 3: Perplexity of neural-encoded HMMs and baseline HMM for varying hidden sizes with dense transition and emission matrices on the left, Monarch matrices on the right.
  • Figure 4: Perplexity of (Monarch) HMMs vs neural HMMs on the GPT2-large dataset for different continuation lengths.

Theorems & Definitions (4)

  • Example 1
  • Proposition 1
  • Proposition 2
  • Definition 1