Table of Contents
Fetching ...

Adaptive Exploration for Data-Efficient General Value Function Evaluations

Arushi Jain, Josiah P. Hanna, Doina Precup

TL;DR

GVFExplorer is introduced, which adaptively learns a single behavior policy that efficiently collects data for evaluating multiple GVFs in parallel, and proves that each behavior policy update decreases the overall mean squared error in GVF predictions.

Abstract

General Value Functions (GVFs) (Sutton et al., 2011) represent predictive knowledge in reinforcement learning. Each GVF computes the expected return for a given policy, based on a unique reward. Existing methods relying on fixed behavior policies or pre-collected data often face data efficiency issues when learning multiple GVFs in parallel using off-policy methods. To address this, we introduce GVFExplorer, which adaptively learns a single behavior policy that efficiently collects data for evaluating multiple GVFs in parallel. Our method optimizes the behavior policy by minimizing the total variance in return across GVFs, thereby reducing the required environmental interactions. We use an existing temporal-difference-style variance estimator to approximate the return variance. We prove that each behavior policy update decreases the overall mean squared error in GVF predictions. We empirically show our method's performance in tabular and nonlinear function approximation settings, including Mujoco environments, with stationary and non-stationary reward signals, optimizing data usage and reducing prediction errors across multiple GVFs.

Adaptive Exploration for Data-Efficient General Value Function Evaluations

TL;DR

GVFExplorer is introduced, which adaptively learns a single behavior policy that efficiently collects data for evaluating multiple GVFs in parallel, and proves that each behavior policy update decreases the overall mean squared error in GVF predictions.

Abstract

General Value Functions (GVFs) (Sutton et al., 2011) represent predictive knowledge in reinforcement learning. Each GVF computes the expected return for a given policy, based on a unique reward. Existing methods relying on fixed behavior policies or pre-collected data often face data efficiency issues when learning multiple GVFs in parallel using off-policy methods. To address this, we introduce GVFExplorer, which adaptively learns a single behavior policy that efficiently collects data for evaluating multiple GVFs in parallel. Our method optimizes the behavior policy by minimizing the total variance in return across GVFs, thereby reducing the required environmental interactions. We use an existing temporal-difference-style variance estimator to approximate the return variance. We prove that each behavior policy update decreases the overall mean squared error in GVF predictions. We empirically show our method's performance in tabular and nonlinear function approximation settings, including Mujoco environments, with stationary and non-stationary reward signals, optimizing data usage and reducing prediction errors across multiple GVFs.
Paper Structure (43 sections, 8 theorems, 35 equations, 17 figures, 3 tables, 2 algorithms)

This paper contains 43 sections, 8 theorems, 35 equations, 17 figures, 3 tables, 2 algorithms.

Key Result

Theorem 4.1

(Behavior Policy Update:) Given $\mathcal{N}$ target policies $\pi_i$ for $i \in\{1 \dots \mathcal{N} \}$, let $k \in \{1, \dots, K\}$ denote the number of updates to the behavior policy $\mu$ and let $\rho_i(s,a) = \dfrac{\pi_i(a|s)}{\mu(a|s)}$ be the per-step IS weight. Using the variance state-ac

Figures (17)

  • Figure 1: MSE Performance: Averaged MSE over $25$ runs with standard error in different experimental settings. GVFExplorer demonstrate notably lower MSE compared to the baselines.
  • Figure 2: Two Distinct Policies & Distinct Cumulants: Evaluate averaged MSE over 25 runs with two distinct distractor GVFs $(\pi_1,c_1),(\pi_2,c_2)$ in gridworld . Green dots at top show two GVF goals. (a) Averaged MSE, (b) averaged absolute error in GVFs value predictions for baseline RoundRobin and (c) GVFExplorer. The color bar uses log scale & vibrant colors indicate higher values.
  • Figure 3: Non-Linear Function Approximation: (a) Averaged MSE over $50$ runs with standard error using Experience Replay Buffer (solid lines) and PER (dotted lines). GVFExplorer show lower MSE with both buffers. PER generally reduces MSE across all algorithms except SR. Log-scale absolute value error for RoundRobin (b) and GVFExplorer (c); GVFExplorer achieves smaller errors (vibrant colors represent higher values).
  • Figure 4: MSE in Mujoco: Averaged MSE over $5$ runs with standard error in Mujoco environment with continuous state-actions for (a)Walker and (b)Cheetah domains for GVFExplorer, UniformPolicy and RoundRobin. GVFExplorer consistently lowers averaged MSE as compared to the baselines.
  • Figure 5: Visual representation of cumulants.
  • ...and 12 more figures

Theorems & Definitions (17)

  • Theorem 4.1
  • proof
  • Theorem 4.2
  • proof
  • Lemma 5.0
  • proof
  • Theorem A.1
  • proof
  • Theorem A.1
  • proof
  • ...and 7 more