Table of Contents
Fetching ...

STITCH-OPE: Trajectory Stitching with Guided Diffusion for Off-Policy Evaluation

Hossein Goli, Michael Gimelfarb, Nathan Samuel de Lara, Haruki Nishimura, Masha Itkina, Florian Shkurti

TL;DR

STITCH-OPE tackles off-policy evaluation in high-dimensional, long-horizon environments by combining a model-based diffusion framework with sub-trajectory stitching and negative guidance. It trains a diffusion model on behavior data to generate short, conditioned sub-trajectories and guides the denoising process using a score-based policy difference, then stitches these pieces into full trajectories to estimate target-policy returns. Theoretical analysis provides bias and variance bounds showing exponential variance reduction with respect to horizon when using a fixed short window w, and empirical results on D4RL and OpenAI Gym benchmarks demonstrate superior mean-squared error, correlation, and regret metrics compared to baselines, including diffusion-policy variants. The approach is scalable to high-dimensional tasks and flexible to diffusion-policy target classes, with practical guidance on tuning the sampling coefficients and sub-trajectory length. This framework offers a robust, data-efficient path for offline policy evaluation in domains where online interaction is costly or unsafe.

Abstract

Off-policy evaluation (OPE) estimates the performance of a target policy using offline data collected from a behavior policy, and is crucial in domains such as robotics or healthcare where direct interaction with the environment is costly or unsafe. Existing OPE methods are ineffective for high-dimensional, long-horizon problems, due to exponential blow-ups in variance from importance weighting or compounding errors from learned dynamics models. To address these challenges, we propose STITCH-OPE, a model-based generative framework that leverages denoising diffusion for long-horizon OPE in high-dimensional state and action spaces. Starting with a diffusion model pre-trained on the behavior data, STITCH-OPE generates synthetic trajectories from the target policy by guiding the denoising process using the score function of the target policy. STITCH-OPE proposes two technical innovations that make it advantageous for OPE: (1) prevents over-regularization by subtracting the score of the behavior policy during guidance, and (2) generates long-horizon trajectories by stitching partial trajectories together end-to-end. We provide a theoretical guarantee that under mild assumptions, these modifications result in an exponential reduction in variance versus long-horizon trajectory diffusion. Experiments on the D4RL and OpenAI Gym benchmarks show substantial improvement in mean squared error, correlation, and regret metrics compared to state-of-the-art OPE methods.

STITCH-OPE: Trajectory Stitching with Guided Diffusion for Off-Policy Evaluation

TL;DR

STITCH-OPE tackles off-policy evaluation in high-dimensional, long-horizon environments by combining a model-based diffusion framework with sub-trajectory stitching and negative guidance. It trains a diffusion model on behavior data to generate short, conditioned sub-trajectories and guides the denoising process using a score-based policy difference, then stitches these pieces into full trajectories to estimate target-policy returns. Theoretical analysis provides bias and variance bounds showing exponential variance reduction with respect to horizon when using a fixed short window w, and empirical results on D4RL and OpenAI Gym benchmarks demonstrate superior mean-squared error, correlation, and regret metrics compared to baselines, including diffusion-policy variants. The approach is scalable to high-dimensional tasks and flexible to diffusion-policy target classes, with practical guidance on tuning the sampling coefficients and sub-trajectory length. This framework offers a robust, data-efficient path for offline policy evaluation in domains where online interaction is costly or unsafe.

Abstract

Off-policy evaluation (OPE) estimates the performance of a target policy using offline data collected from a behavior policy, and is crucial in domains such as robotics or healthcare where direct interaction with the environment is costly or unsafe. Existing OPE methods are ineffective for high-dimensional, long-horizon problems, due to exponential blow-ups in variance from importance weighting or compounding errors from learned dynamics models. To address these challenges, we propose STITCH-OPE, a model-based generative framework that leverages denoising diffusion for long-horizon OPE in high-dimensional state and action spaces. Starting with a diffusion model pre-trained on the behavior data, STITCH-OPE generates synthetic trajectories from the target policy by guiding the denoising process using the score function of the target policy. STITCH-OPE proposes two technical innovations that make it advantageous for OPE: (1) prevents over-regularization by subtracting the score of the behavior policy during guidance, and (2) generates long-horizon trajectories by stitching partial trajectories together end-to-end. We provide a theoretical guarantee that under mild assumptions, these modifications result in an exponential reduction in variance versus long-horizon trajectory diffusion. Experiments on the D4RL and OpenAI Gym benchmarks show substantial improvement in mean squared error, correlation, and regret metrics compared to state-of-the-art OPE methods.

Paper Structure

This paper contains 65 sections, 10 theorems, 113 equations, 11 figures, 11 tables, 2 algorithms.

Key Result

Theorem 3.3

Define $\hat{p}_\pi$ as the (length-$T$) trajectory distribution of the guided diffusion model, and $p_\pi$ as the true trajectory distribution under the target policy $\pi$. Under Assumptions a1 and a2, the mean squared error (MSE) of the STITCH-OPE return $\hat{J}$ satisfies: where $B_w = \frac{1 - \gamma^w}{1 - \gamma} \sup_{s,a}|R(s,a)|$ is a bound on the maximum length-$w$ discounted return,

Figures (11)

  • Figure 1: A conceptual illustration of STITCH-OPE, with novel contributions highlighted in orange. A: Behavior data is sliced into partial trajectories of length $w$. B: The data is fed to a conditional diffusion model taking a $w$-length sequence of Gaussian noise $\epsilon$ and state $s_{t}$ as inputs, and applies the backward diffusion process to predict the behavior trajectory of length $w$ beginning in state $s_{t}$. C: To evaluate policies, STITCH-OPE also trains a neural network on the behavior transitions to predict the immediate reward. D: It then applies guided diffusion on the pretrained diffusion model to generate a batch of partial target trajectories of length $w$, where the guidance function incorporates the score function of the target policy and the behavior policy. E: The guided partial trajectories are stitched end-to-end to produce full-length target trajectories. Finally, the guided trajectories are evaluated using the empirical reward function $\hat{R}(s,a)$, and averaged to estimate the value of the target policy.
  • Figure 2: Mean overall performance of all baselines, averaged across environments. Error bars represent +/- one standard error.
  • Figure 3: Pedagogical example illustrating guided diffusion sample generation for a Gaussian mixture $0.5 \mathcal{N}(1, 0.5^2) + 0.5 \mathcal{N}(-1, 0.5^2)$. Top row: histograms of samples from unguided backward diffusion at steps $k = 8, 6, 4, 0$, where $\nabla \log p(x)$ is the score of the Gaussian mixture shown in blue. Bottom row: histograms of samples from guided diffusion (\ref{['eq:diffusion-guidance']}) using the score function of a $\mathcal{N}(1, 0.5^2)$ distribution, i.e. $g(x) = -(x - 1) / 0.5^2$. The modified score function corresponding to the guided diffusion process is shown in blue. The guided score function (the score of the actual sampling density) is significantly shifted and skewed, relative to the original score function, at the intermediate denoising time steps ($k=6, 4$). This ensures that the right mode of the Gaussian mixture is sampled more frequently during denoising.
  • Figure 4: Illustration of the sub-trajectory decomposition. Each chunk $S_i$ generates a reward sequence $Y_i$ and leads to a boundary state $X_{i+1}$.
  • Figure 5: Smoothed performance landscape for Hopper. Left: Spearman correlation is largest around $\alpha \in [0.1, 0.5], \, \lambda \leq 0.5\alpha$. Right: The LogRMSE is smallest around $\alpha \in [0.01, 0.5],\, \lambda \in [0.25\alpha, 0.75\alpha]$. These results confirm the optimal range of $\lambda$ is $0 < \lambda < \alpha$.
  • ...and 6 more figures

Theorems & Definitions (22)

  • Theorem 3.3
  • Definition C.1: Entropy
  • Definition C.2: Conditional Entropy
  • Theorem C.3
  • proof
  • Definition D.1: Chunked Behavior Distributions
  • Definition D.2: Total Variation Distance
  • Lemma D.5
  • proof
  • Lemma D.6: Expectation Difference Bound via Total Variation
  • ...and 12 more