Primal-Dual Spectral Representation for Off-policy Evaluation
Yang Hu, Tianyi Chen, Na Li, Kai Wang, Bo Dai
TL;DR
The paper tackles off-policy evaluation in offline RL by removing the non-convex saddle-point optimization typical of DICE estimators through a novel primal-dual spectral representation of the transition operator. This representation yields linear, convex-concave formulations for the Q-function and stationary density-ratio, enabling SpectralDICE to efficiently leverage historical data with unknown behavior policies. A rigorous sample complexity guarantee accompanies the method, and empirical results on diverse benchmarks show faster convergence and lower OPE error compared with baselines like BestDICE, particularly in challenging settings. The work suggests promising future directions for offline policy optimization and imitation learning using the same primal-dual spectral framework.
Abstract
Off-policy evaluation (OPE) is one of the most fundamental problems in reinforcement learning (RL) to estimate the expected long-term payoff of a given target policy with only experiences from another behavior policy that is potentially unknown. The distribution correction estimation (DICE) family of estimators have advanced the state of the art in OPE by breaking the curse of horizon. However, the major bottleneck of applying DICE estimators lies in the difficulty of solving the saddle-point optimization involved, especially with neural network implementations. In this paper, we tackle this challenge by establishing a linear representation of value function and stationary distribution correction ratio, i.e., primal and dual variables in the DICE framework, using the spectral decomposition of the transition operator. Such primal-dual representation not only bypasses the non-convex non-concave optimization in vanilla DICE, therefore enabling an computational efficient algorithm, but also paves the way for more efficient utilization of historical data. We highlight that our algorithm, SpectralDICE, is the first to leverage the linear representation of primal-dual variables that is both computation and sample efficient, the performance of which is supported by a rigorous theoretical sample complexity guarantee and a thorough empirical evaluation on various benchmarks.
