Multiple-policy Evaluation via Density Estimation

Yilei Chen; Aldo Pacchiano; Ioannis Ch. Paschalidis

Multiple-policy Evaluation via Density Estimation

Yilei Chen, Aldo Pacchiano, Ioannis Ch. Paschalidis

TL;DR

This paper tackles offline evaluation of multiple policies by introducing CAESAR, a two-phase method that first builds coarse visitation estimators and then constructs an approximately optimal offline sampling distribution to jointly estimate policy values. The second phase estimates importance-weighting ratios via IDES, a step-wise extension of DualDICE tailored to finite-horizon MDPs, with high-probability guarantees achieved using a Median-of-Means approach. The main result provides a non-asymptotic, instance-dependent sample complexity that scales with the visitation overlaps of target policies, offering improvements over trajectory-stitching baselines and mitigating dependence on the number of policies $K$. Additional contributions include MARCH for coarse, scalable estimation across many deterministic policies and policy identification extensions, expanding the utility of coarse density estimation in offline RL. Overall, CAESAR advances practical, provable multi-policy evaluation by leveraging density estimation to design efficient behavior distributions and robust ratio estimators.

Abstract

We study the multiple-policy evaluation problem where we are given a set of $K$ policies and the goal is to evaluate their performance (expected total reward over a fixed horizon) to an accuracy $ε$ with probability at least $1-δ$. We propose an algorithm named $\mathrm{CAESAR}$ for this problem. Our approach is based on computing an approximate optimal offline sampling distribution and using the data sampled from it to perform the simultaneous estimation of the policy values. $\mathrm{CAESAR}$ has two phases. In the first we produce coarse estimates of the visitation distributions of the target policies at a low order sample complexity rate that scales with $\tilde{O}(\frac{1}ε)$. In the second phase, we approximate the optimal offline sampling distribution and compute the importance weighting ratios for all target policies by minimizing a step-wise quadratic loss function inspired by the DualDICE \cite{nachum2019dualdice} objective. Up to low order and logarithmic terms $\mathrm{CAESAR}$ achieves a sample complexity $\tilde{O}\left(\frac{H^4}{ε^2}\sum_{h=1}^H\max_{k\in[K]}\sum_{s,a}\frac{(d_h^{π^k}(s,a))^2}{μ^*_h(s,a)}\right)$, where $d^π$ is the visitation distribution of policy $π$, $μ^*$ is the optimal sampling distribution, and $H$ is the horizon.

Multiple-policy Evaluation via Density Estimation

TL;DR

. Additional contributions include MARCH for coarse, scalable estimation across many deterministic policies and policy identification extensions, expanding the utility of coarse density estimation in offline RL. Overall, CAESAR advances practical, provable multi-policy evaluation by leveraging density estimation to design efficient behavior distributions and robust ratio estimators.

Abstract

We study the multiple-policy evaluation problem where we are given a set of

policies and the goal is to evaluate their performance (expected total reward over a fixed horizon) to an accuracy

with probability at least

. We propose an algorithm named

for this problem. Our approach is based on computing an approximate optimal offline sampling distribution and using the data sampled from it to perform the simultaneous estimation of the policy values.

has two phases. In the first we produce coarse estimates of the visitation distributions of the target policies at a low order sample complexity rate that scales with

. In the second phase, we approximate the optimal offline sampling distribution and compute the importance weighting ratios for all target policies by minimizing a step-wise quadratic loss function inspired by the DualDICE \cite{nachum2019dualdice} objective. Up to low order and logarithmic terms

achieves a sample complexity

, where

is the visitation distribution of policy

is the optimal sampling distribution, and

is the horizon.

Paper Structure (32 sections, 20 theorems, 113 equations, 1 figure, 4 algorithms)

This paper contains 32 sections, 20 theorems, 113 equations, 1 figure, 4 algorithms.

Introduction
Related Work
Preliminaries
Notations
Reinforcement learning framework
Multiple-policy evaluation problem setup
Main Results and Algorithm
Coarse estimation of visitation distributions
Approximately optimal sampling distribution
Estimation of importance-weighting ratios
Main results
Discussions
Comparison with existing result
Near-optimal policy identification
Conclusion and Future Work
...and 17 more sections

Key Result

Lemma 4.2

Let $Z_\ell$ be i.i.d. samples $Z_\ell\stackrel{i.i.d.}{\sim} \mathrm{Ber}(p)$. Setting $t\ge \frac{C\log(C/\epsilon\delta)}{\epsilon}$, for some known constant $C>0$, it follows that with probability at least $1-\delta$, the empirical mean estimator $\hat{p}_t = \frac{1}{t}\sum_{\ell=1}^t Z_\ell$ s

Figures (1)

Figure 1: The scheme of $\mathrm{CAESAR}$ . In Phase I, we collect $\tilde{O}(1/\epsilon)$ trajectories for each target policies $\pi_1,\dots,\pi_K$ and obtain coarse estimators of their visitation distributions $\hat{d}^{\pi_1},\dots,\hat{d}^{\pi_K}$. Based on the coarse estimator, we can generate an approximately optimal sampling dataset which has good coverage over the visitations of target policies. In Phase II, we sample data from the approximately optimal dataset and leverage the coarse estimators from Phase I to perform importance density estimation for each target policies by implementing $\mathrm{IDES}$ . With the estimated importance density $\hat{w}^{\pi_1},\dots,\hat{w}^{\pi_K}$, we can output the final performance evaluators $\hat{V}^{\pi_1},\dots,\hat{V}^{\pi_K}$.

Theorems & Definitions (34)

Definition 4.1: Coarse Estimator
Lemma 4.2
Proposition 4.3
Lemma 4.4
Lemma 4.5
Lemma 4.6
Lemma 4.7
Lemma 4.8
Theorem 4.9
Corollary 4.10
...and 24 more

Multiple-policy Evaluation via Density Estimation

TL;DR

Abstract

Multiple-policy Evaluation via Density Estimation

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (1)

Theorems & Definitions (34)