Table of Contents
Fetching ...

Incoherence in goal-conditioned autoregressive models

Jacek Karwowski, Raymond Douglas

TL;DR

This work analyzes incoherence arising in goal-conditioned autoregressive reinforcement learning, where policies conditioned on outcomes can misalign with their own future rollouts. It formalizes incoherence via soft Q/V and an information-theoretic KL measure, and shows that retraining on the agent's own actions monotonically improves return and converges to a coherent policy under suitable conditions. The authors unify three common strategies to remove incoherence—retraining on trajectories, lowering the temperature, and folding the posterior into the reward—showing equivalence in deterministic environments and illuminating the training–inference trade-off through the lens of effective horizon. They provide theoretical results on convergence and equivalence, and discuss implications for training in multi-step, stochastic settings, with potential applications to language-model fine-tuning and RLHF-like techniques. The work lays a principled foundation for aligning goal-conditioned autoregressive policies with their own deployment behavior and suggests directions for practical algorithms and future experiments.

Abstract

We investigate mathematically the notion of incoherence: a structural issue with reinforcement learning policies derived by naive goal-conditioning of autoregressive models. We focus on the process of re-training models on their own actions, that is, fine-tuning offline-learned policies with online RL. We prove that it decreases incoherence and leads to an improvement in return, and we aim to characterize the resulting trajectory of policies. By re-framing standard notions of control-as-inference and soft Q learning, we establish a three-way correspondence with two other ways of understanding the iterative re-training process: as folding the posterior into the reward and, in the deterministic case, as decreasing the temperature parameter; the correspondence has computational content via the training-inference trade-off. Through soft-conditioning generative models, we discuss the link between incoherence and the effective horizon.

Incoherence in goal-conditioned autoregressive models

TL;DR

This work analyzes incoherence arising in goal-conditioned autoregressive reinforcement learning, where policies conditioned on outcomes can misalign with their own future rollouts. It formalizes incoherence via soft Q/V and an information-theoretic KL measure, and shows that retraining on the agent's own actions monotonically improves return and converges to a coherent policy under suitable conditions. The authors unify three common strategies to remove incoherence—retraining on trajectories, lowering the temperature, and folding the posterior into the reward—showing equivalence in deterministic environments and illuminating the training–inference trade-off through the lens of effective horizon. They provide theoretical results on convergence and equivalence, and discuss implications for training in multi-step, stochastic settings, with potential applications to language-model fine-tuning and RLHF-like techniques. The work lays a principled foundation for aligning goal-conditioned autoregressive policies with their own deployment behavior and suggests directions for practical algorithms and future experiments.

Abstract

We investigate mathematically the notion of incoherence: a structural issue with reinforcement learning policies derived by naive goal-conditioning of autoregressive models. We focus on the process of re-training models on their own actions, that is, fine-tuning offline-learned policies with online RL. We prove that it decreases incoherence and leads to an improvement in return, and we aim to characterize the resulting trajectory of policies. By re-framing standard notions of control-as-inference and soft Q learning, we establish a three-way correspondence with two other ways of understanding the iterative re-training process: as folding the posterior into the reward and, in the deterministic case, as decreasing the temperature parameter; the correspondence has computational content via the training-inference trade-off. Through soft-conditioning generative models, we discuss the link between incoherence and the effective horizon.

Paper Structure

This paper contains 19 sections, 28 theorems, 104 equations, 4 figures.

Key Result

Proposition 4.2

For any prior $\pi(a|s)$ and any non-positive reward function $r(a,s)$, we have simple expressions for the soft $Q$ and $V$ functions given by: where we define the auxiliary optimality variables$\mathcal{O}_t$ to have Bernoulli distributions with:

Figures (4)

  • Figure 1: Tree representation of the MDP defining the mountain race (\ref{['example:two-cards']}). States are represented as the tree nodes, we put a uniform prior over actions (depicted as arrows $\nearrow, \searrow$). Rewards for each terminal state (tree leaf) are written on the right.
  • Figure 2: The fixed point of iterated $f$-coherence achieved after $T = 2$ iterations.
  • Figure 3: The fixed point of iterated $f$-coherence achieved after $t = 2 = T$ iterations.
  • Figure 4: $\pi$ cannot satisfy both 2-policy-stability and 1-policy-stability.

Theorems & Definitions (57)

  • Example 1: Mountain race
  • Definition 4.1: Soft $Q$ and $V$ functions
  • Proposition 4.2: Characterisation of $V$ and $Q$
  • Proposition 4.3
  • Proposition 4.3
  • Definition 4.4: Order‑respecting $f$
  • Definition 4.5: $f$-soft Q policy
  • Definition 4.6: $f$-incoherence
  • Proposition 4.7
  • Proposition 4.8
  • ...and 47 more