Table of Contents
Fetching ...

Truncating Trajectories in Monte Carlo Policy Evaluation: an Adaptive Approach

Riccardo Poiani, Nicole Nobili, Alberto Maria Metelli, Marcello Restelli

TL;DR

Results show that RIDO can adapt its trajectory schedule toward timesteps where more sampling is required to increase the quality of the final estimation, and suggests that adaptive data collection strategies that spend the available budget sequentially can allocate a larger portion of transitions in timesteps in which more accurate sampling is required to reduce the error of the final estimate.

Abstract

Policy evaluation via Monte Carlo (MC) simulation is at the core of many MC Reinforcement Learning (RL) algorithms (e.g., policy gradient methods). In this context, the designer of the learning system specifies an interaction budget that the agent usually spends by collecting trajectories of fixed length within a simulator. However, is this data collection strategy the best option? To answer this question, in this paper, we propose as a quality index a surrogate of the mean squared error of a return estimator that uses trajectories of different lengths, i.e., \emph{truncated}. Specifically, this surrogate shows the sub-optimality of the fixed-length trajectory schedule. Furthermore, it suggests that adaptive data collection strategies that spend the available budget sequentially can allocate a larger portion of transitions in timesteps in which more accurate sampling is required to reduce the error of the final estimate. Building on these findings, we present an adaptive algorithm called Robust and Iterative Data collection strategy Optimization (RIDO). The main intuition behind RIDO is to split the available interaction budget into mini-batches. At each round, the agent determines the most convenient schedule of trajectories that minimizes an empirical and robust version of the surrogate of the estimator's error. After discussing the theoretical properties of our method, we conclude by assessing its performance across multiple domains. Our results show that RIDO can adapt its trajectory schedule toward timesteps where more sampling is required to increase the quality of the final estimation.

Truncating Trajectories in Monte Carlo Policy Evaluation: an Adaptive Approach

TL;DR

Results show that RIDO can adapt its trajectory schedule toward timesteps where more sampling is required to increase the quality of the final estimation, and suggests that adaptive data collection strategies that spend the available budget sequentially can allocate a larger portion of transitions in timesteps in which more accurate sampling is required to reduce the error of the final estimate.

Abstract

Policy evaluation via Monte Carlo (MC) simulation is at the core of many MC Reinforcement Learning (RL) algorithms (e.g., policy gradient methods). In this context, the designer of the learning system specifies an interaction budget that the agent usually spends by collecting trajectories of fixed length within a simulator. However, is this data collection strategy the best option? To answer this question, in this paper, we propose as a quality index a surrogate of the mean squared error of a return estimator that uses trajectories of different lengths, i.e., \emph{truncated}. Specifically, this surrogate shows the sub-optimality of the fixed-length trajectory schedule. Furthermore, it suggests that adaptive data collection strategies that spend the available budget sequentially can allocate a larger portion of transitions in timesteps in which more accurate sampling is required to reduce the error of the final estimate. Building on these findings, we present an adaptive algorithm called Robust and Iterative Data collection strategy Optimization (RIDO). The main intuition behind RIDO is to split the available interaction budget into mini-batches. At each round, the agent determines the most convenient schedule of trajectories that minimizes an empirical and robust version of the surrogate of the estimator's error. After discussing the theoretical properties of our method, we conclude by assessing its performance across multiple domains. Our results show that RIDO can adapt its trajectory schedule toward timesteps where more sampling is required to increase the quality of the final estimation.

Paper Structure

This paper contains 36 sections, 20 theorems, 145 equations, 6 figures, 1 table, 1 algorithm.

Key Result

Theorem 3.1

Fix $\epsilon > 0$ and consider an online algorithm such that Equation eq:consistency holds almost surely. Then, we have that:

Figures (6)

  • Figure 1: Visualization of the transformation between the optimization problems. The first row shows the objective function of the original optimization problem, while the second one its transformation.
  • Figure 2: Empirical MSE (mean and $95$% confidence intervals over $100$ runs) on the considered domains and baselines. The first row considers higher values of $\gamma$ w.r.t. the second one.
  • Figure 3: Ablations on different values of $\beta$ on Examples \ref{['exe:1']} (top) and \ref{['exe:2']} (bottom). Empirical MSE (mean and $95$% confidence intervals over $100$ runs) (left). DCS visualiaztion (mean and $95$% confidence intervals over $100$ runs) using $\Lambda = 10000$ (right).
  • Figure 4: Ablations on different mini-batch sizes on Examples \ref{['exe:1']} (top) and \ref{['exe:2']} (bottom). Empirical MSE (mean and $95$% confidence intervals over $100$ runs) (left). DCS visualiaztion (mean and $95$% confidence intervals over $100$ runs) using $\Lambda = 1000$ (right).
  • Figure 5: DCS visualiaztion for Pendulum, LQG, Navigation and Ant (mean and $95$% confidence intervals over $100$ runs). The $x$ axis reports the timestep $t$, while the $y$ axis $n_t$. For Pendulum, LQG and Navigation, we consider $\Lambda=5000$, while for the Ant $\Lambda=20000$.
  • ...and 1 more figures

Theorems & Definitions (40)

  • Example 1
  • Example 2
  • Theorem 3.1
  • Theorem 3.2
  • Example 3: continues=exe:1
  • Example 4: continues=exe:2
  • Theorem 4.1
  • Theorem A.1
  • proof
  • Theorem A.1
  • ...and 30 more