Table of Contents
Fetching ...

LRT-Diffusion: Calibrated Risk-Aware Guidance for Diffusion Policies

Ximan Sun, Xiang Cheng

TL;DR

LRT-Diffusion introduces a principled, inference-time risk-control mechanism for diffusion policies in offline RL by casting each denoising step as a likelihood-ratio test between a background head and a good-head. A single calibrated threshold τ enforces a level-α gate, converting static guidance into evidence-driven updates without altering training. The method builds a two-head, IQL-based labeling scheme and proves Neyman–Pearson optimality for the hard gate under equal covariances, plus finite-sample calibration and stability guarantees. Empirically on D4RL MuJoCo tasks, LRT-Diffusion achieves calibrated Type-I control and improved return–OOD trade-offs, and can be composed with a small Q-step for exploitation. Overall, the work offers a drop-in, principled risk-control mechanism that can be tuned at inference time to manage distribution shift while maintaining performance.

Abstract

Diffusion policies are competitive for offline reinforcement learning (RL) but are typically guided at sampling time by heuristics that lack a statistical notion of risk. We introduce LRT-Diffusion, a risk-aware sampling rule that treats each denoising step as a sequential hypothesis test between the unconditional prior and the state-conditional policy head. Concretely, we accumulate a log-likelihood ratio and gate the conditional mean with a logistic controller whose threshold tau is calibrated once under H0 to meet a user-specified Type-I level alpha. This turns guidance from a fixed push into an evidence-driven adjustment with a user-interpretable risk budget. Importantly, we deliberately leave training vanilla (two heads with standard epsilon-prediction) under the structure of DDPM. LRT guidance composes naturally with Q-gradients: critic-gradient updates can be taken at the unconditional mean, at the LRT-gated mean, or a blend, exposing a continuum from exploitation to conservatism. We standardize states and actions consistently at train and test time and report a state-conditional out-of-distribution (OOD) metric alongside return. On D4RL MuJoCo tasks, LRT-Diffusion improves the return-OOD trade-off over strong Q-guided baselines in our implementation while honoring the desired alpha. Theoretically, we establish level-alpha calibration, concise stability bounds, and a return comparison showing when LRT surpasses Q-guidance-especially when off-support errors dominate. Overall, LRT-Diffusion is a drop-in, inference-time method that adds principled, calibrated risk control to diffusion policies for offline RL.

LRT-Diffusion: Calibrated Risk-Aware Guidance for Diffusion Policies

TL;DR

LRT-Diffusion introduces a principled, inference-time risk-control mechanism for diffusion policies in offline RL by casting each denoising step as a likelihood-ratio test between a background head and a good-head. A single calibrated threshold τ enforces a level-α gate, converting static guidance into evidence-driven updates without altering training. The method builds a two-head, IQL-based labeling scheme and proves Neyman–Pearson optimality for the hard gate under equal covariances, plus finite-sample calibration and stability guarantees. Empirically on D4RL MuJoCo tasks, LRT-Diffusion achieves calibrated Type-I control and improved return–OOD trade-offs, and can be composed with a small Q-step for exploitation. Overall, the work offers a drop-in, principled risk-control mechanism that can be tuned at inference time to manage distribution shift while maintaining performance.

Abstract

Diffusion policies are competitive for offline reinforcement learning (RL) but are typically guided at sampling time by heuristics that lack a statistical notion of risk. We introduce LRT-Diffusion, a risk-aware sampling rule that treats each denoising step as a sequential hypothesis test between the unconditional prior and the state-conditional policy head. Concretely, we accumulate a log-likelihood ratio and gate the conditional mean with a logistic controller whose threshold tau is calibrated once under H0 to meet a user-specified Type-I level alpha. This turns guidance from a fixed push into an evidence-driven adjustment with a user-interpretable risk budget. Importantly, we deliberately leave training vanilla (two heads with standard epsilon-prediction) under the structure of DDPM. LRT guidance composes naturally with Q-gradients: critic-gradient updates can be taken at the unconditional mean, at the LRT-gated mean, or a blend, exposing a continuum from exploitation to conservatism. We standardize states and actions consistently at train and test time and report a state-conditional out-of-distribution (OOD) metric alongside return. On D4RL MuJoCo tasks, LRT-Diffusion improves the return-OOD trade-off over strong Q-guided baselines in our implementation while honoring the desired alpha. Theoretically, we establish level-alpha calibration, concise stability bounds, and a return comparison showing when LRT surpasses Q-guidance-especially when off-support errors dominate. Overall, LRT-Diffusion is a drop-in, inference-time method that adds principled, calibrated risk control to diffusion policies for offline RL.

Paper Structure

This paper contains 92 sections, 8 theorems, 80 equations, 2 figures, 1 table, 2 algorithms.

Key Result

Proposition 1

For the reverse chain conditioned on $s$ with two simple hypotheses follows the two-head reverse model of Sec. sec:lrt-reverse. Then the Neyman-Pearson test that rejects $H_0$ when $\ell_{\mathrm{cum}}\ge\tau$ is uniformly most powerful among all level-$\alpha$ tests.

Figures (2)

  • Figure 1: Risk–performance and Pareto fronts across tasks. Left of each row: risk–performance curves versus target $\alpha$. We report return (top), realized Type-I (middle), and state-conditional OOD (bottom) for LRT (solid) and QG (dashed); the realized Type-I tracks the target (gray) within finite-sample DKW bands. Right of each row: Pareto fronts (OOD vs. return; color encodes $\alpha$). LRT shifts the frontier up-and-left relative to QG on tasks where off-support critic error dominates, yielding higher return at lower OOD for the same $\alpha$. Error bars denote standard errors over evaluation rollouts.
  • Figure 2: LLR traces. Prefix LLR across denoising steps on random states; lines are different trajectories. Data: hopper-medium-replay-v2.

Theorems & Definitions (15)

  • Proposition 1: Neyman--Pearson optimality
  • Lemma 1: Soft$\to$hard limit under logistic gate
  • Proposition 2: Calibrated semantics under the deployment sampler
  • Theorem 2: Calibration accuracy via DKW Dvoretzky1956DKWMassart1990
  • Lemma 3: Deterministic displacement bound
  • Proposition 3: Return comparison under offline errors
  • proof : Sketch
  • Proposition 4: From level-$\alpha$ to an OOD bound
  • proof
  • proof
  • ...and 5 more