LRT-Diffusion: Calibrated Risk-Aware Guidance for Diffusion Policies

Ximan Sun; Xiang Cheng

LRT-Diffusion: Calibrated Risk-Aware Guidance for Diffusion Policies

Ximan Sun, Xiang Cheng

TL;DR

LRT-Diffusion introduces a principled, inference-time risk-control mechanism for diffusion policies in offline RL by casting each denoising step as a likelihood-ratio test between a background head and a good-head. A single calibrated threshold τ enforces a level-α gate, converting static guidance into evidence-driven updates without altering training. The method builds a two-head, IQL-based labeling scheme and proves Neyman–Pearson optimality for the hard gate under equal covariances, plus finite-sample calibration and stability guarantees. Empirically on D4RL MuJoCo tasks, LRT-Diffusion achieves calibrated Type-I control and improved return–OOD trade-offs, and can be composed with a small Q-step for exploitation. Overall, the work offers a drop-in, principled risk-control mechanism that can be tuned at inference time to manage distribution shift while maintaining performance.

Abstract

Diffusion policies are competitive for offline reinforcement learning (RL) but are typically guided at sampling time by heuristics that lack a statistical notion of risk. We introduce LRT-Diffusion, a risk-aware sampling rule that treats each denoising step as a sequential hypothesis test between the unconditional prior and the state-conditional policy head. Concretely, we accumulate a log-likelihood ratio and gate the conditional mean with a logistic controller whose threshold tau is calibrated once under H0 to meet a user-specified Type-I level alpha. This turns guidance from a fixed push into an evidence-driven adjustment with a user-interpretable risk budget. Importantly, we deliberately leave training vanilla (two heads with standard epsilon-prediction) under the structure of DDPM. LRT guidance composes naturally with Q-gradients: critic-gradient updates can be taken at the unconditional mean, at the LRT-gated mean, or a blend, exposing a continuum from exploitation to conservatism. We standardize states and actions consistently at train and test time and report a state-conditional out-of-distribution (OOD) metric alongside return. On D4RL MuJoCo tasks, LRT-Diffusion improves the return-OOD trade-off over strong Q-guided baselines in our implementation while honoring the desired alpha. Theoretically, we establish level-alpha calibration, concise stability bounds, and a return comparison showing when LRT surpasses Q-guidance-especially when off-support errors dominate. Overall, LRT-Diffusion is a drop-in, inference-time method that adds principled, calibrated risk control to diffusion policies for offline RL.

LRT-Diffusion: Calibrated Risk-Aware Guidance for Diffusion Policies

TL;DR

Abstract

LRT-Diffusion: Calibrated Risk-Aware Guidance for Diffusion Policies

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (15)