Table of Contents
Fetching ...

Primal-Dual Spectral Representation for Off-policy Evaluation

Yang Hu, Tianyi Chen, Na Li, Kai Wang, Bo Dai

TL;DR

The paper tackles off-policy evaluation in offline RL by removing the non-convex saddle-point optimization typical of DICE estimators through a novel primal-dual spectral representation of the transition operator. This representation yields linear, convex-concave formulations for the Q-function and stationary density-ratio, enabling SpectralDICE to efficiently leverage historical data with unknown behavior policies. A rigorous sample complexity guarantee accompanies the method, and empirical results on diverse benchmarks show faster convergence and lower OPE error compared with baselines like BestDICE, particularly in challenging settings. The work suggests promising future directions for offline policy optimization and imitation learning using the same primal-dual spectral framework.

Abstract

Off-policy evaluation (OPE) is one of the most fundamental problems in reinforcement learning (RL) to estimate the expected long-term payoff of a given target policy with only experiences from another behavior policy that is potentially unknown. The distribution correction estimation (DICE) family of estimators have advanced the state of the art in OPE by breaking the curse of horizon. However, the major bottleneck of applying DICE estimators lies in the difficulty of solving the saddle-point optimization involved, especially with neural network implementations. In this paper, we tackle this challenge by establishing a linear representation of value function and stationary distribution correction ratio, i.e., primal and dual variables in the DICE framework, using the spectral decomposition of the transition operator. Such primal-dual representation not only bypasses the non-convex non-concave optimization in vanilla DICE, therefore enabling an computational efficient algorithm, but also paves the way for more efficient utilization of historical data. We highlight that our algorithm, SpectralDICE, is the first to leverage the linear representation of primal-dual variables that is both computation and sample efficient, the performance of which is supported by a rigorous theoretical sample complexity guarantee and a thorough empirical evaluation on various benchmarks.

Primal-Dual Spectral Representation for Off-policy Evaluation

TL;DR

The paper tackles off-policy evaluation in offline RL by removing the non-convex saddle-point optimization typical of DICE estimators through a novel primal-dual spectral representation of the transition operator. This representation yields linear, convex-concave formulations for the Q-function and stationary density-ratio, enabling SpectralDICE to efficiently leverage historical data with unknown behavior policies. A rigorous sample complexity guarantee accompanies the method, and empirical results on diverse benchmarks show faster convergence and lower OPE error compared with baselines like BestDICE, particularly in challenging settings. The work suggests promising future directions for offline policy optimization and imitation learning using the same primal-dual spectral framework.

Abstract

Off-policy evaluation (OPE) is one of the most fundamental problems in reinforcement learning (RL) to estimate the expected long-term payoff of a given target policy with only experiences from another behavior policy that is potentially unknown. The distribution correction estimation (DICE) family of estimators have advanced the state of the art in OPE by breaking the curse of horizon. However, the major bottleneck of applying DICE estimators lies in the difficulty of solving the saddle-point optimization involved, especially with neural network implementations. In this paper, we tackle this challenge by establishing a linear representation of value function and stationary distribution correction ratio, i.e., primal and dual variables in the DICE framework, using the spectral decomposition of the transition operator. Such primal-dual representation not only bypasses the non-convex non-concave optimization in vanilla DICE, therefore enabling an computational efficient algorithm, but also paves the way for more efficient utilization of historical data. We highlight that our algorithm, SpectralDICE, is the first to leverage the linear representation of primal-dual variables that is both computation and sample efficient, the performance of which is supported by a rigorous theoretical sample complexity guarantee and a thorough empirical evaluation on various benchmarks.

Paper Structure

This paper contains 14 sections, 2 theorems, 14 equations.

Key Result

Lemma 2

With primal-dual spectral representation eq:spectral_representation_P_pi, the $Q$-function $Q^{\pi}(\cdot, \cdot)$ is linearly representable in the primal feature space with cofactor $\bm{\theta}_Q^{\pi} \in \mathbb{R}^d$: Further, under assum:initial_distribution, the state-action occupancy measure $d^{\pi}(\cdot, \cdot)$ is also linearly representable in the dual feature space with cofactor $\b

Theorems & Definitions (3)

  • Remark 1
  • Lemma 2
  • Corollary 3