Table of Contents
Fetching ...

Understanding What Affects the Generalization Gap in Visual Reinforcement Learning: Theory and Empirical Evidence

Jiafei Lyu, Le Wan, Xiu Li, Zongqing Lu

TL;DR

Theories indicate that minimizing the representation distance between training and testing environments is the most critical for the benefit of reducing the generalization gap, and these theories are supported by the empirical evidence in the DMControl Generalization Benchmark.

Abstract

Recently, there are many efforts attempting to learn useful policies for continuous control in visual reinforcement learning (RL). In this scenario, it is important to learn a generalizable policy, as the testing environment may differ from the training environment, e.g., there exist distractors during deployment. Many practical algorithms are proposed to handle this problem. However, to the best of our knowledge, none of them provide a theoretical understanding of what affects the generalization gap and why their proposed methods work. In this paper, we bridge this issue by theoretically answering the key factors that contribute to the generalization gap when the testing environment has distractors. Our theories indicate that minimizing the representation distance between training and testing environments, which aligns with human intuition, is the most critical for the benefit of reducing the generalization gap. Our theoretical results are supported by the empirical evidence in the DMControl Generalization Benchmark (DMC-GB).

Understanding What Affects the Generalization Gap in Visual Reinforcement Learning: Theory and Empirical Evidence

TL;DR

Theories indicate that minimizing the representation distance between training and testing environments is the most critical for the benefit of reducing the generalization gap, and these theories are supported by the empirical evidence in the DMControl Generalization Benchmark.

Abstract

Recently, there are many efforts attempting to learn useful policies for continuous control in visual reinforcement learning (RL). In this scenario, it is important to learn a generalizable policy, as the testing environment may differ from the training environment, e.g., there exist distractors during deployment. Many practical algorithms are proposed to handle this problem. However, to the best of our knowledge, none of them provide a theoretical understanding of what affects the generalization gap and why their proposed methods work. In this paper, we bridge this issue by theoretically answering the key factors that contribute to the generalization gap when the testing environment has distractors. Our theories indicate that minimizing the representation distance between training and testing environments, which aligns with human intuition, is the most critical for the benefit of reducing the generalization gap. Our theoretical results are supported by the empirical evidence in the DMControl Generalization Benchmark (DMC-GB).
Paper Structure (14 sections, 15 theorems, 65 equations, 11 figures, 1 table, 1 algorithm)

This paper contains 14 sections, 15 theorems, 65 equations, 11 figures, 1 table, 1 algorithm.

Key Result

Lemma 1

Assume that Assumption ass:policy hold, denote $\phi(\cdot)$ as the encoder. Then under the transition $\mathcal{T}(s,a,\xi)$, at step $t\in\{0,1,\ldots,T\}$ in an episode of length $T+1$, we have

Figures (11)

  • Figure 1: Comparison of state transition before (left) and after (right) reparameterization.
  • Figure 2: Evidence that the Lipschitz condition of the policy holds before and after adding distractors. We present the scatter plot of the policy deviation against the representation deviation of DrQ and PIE-G on walker-walk video-easy task with and without distractors (i.e., the transpose function $f(\cdot)$). The solid line denotes the maximum slope in the batch, i.e., $y=kx$ where $k = \max\frac{\|\pi(\phi(s)) - \pi(\phi(s^\prime))\|}{\|\phi(s) - \phi(s^\prime)\|}$ for the training environment, and $k=\max\frac{\|\pi(\phi(f(s))) - \pi(\phi(f(s^\prime)))\|}{\|\phi(f(s)) - \phi(f(s^\prime))\|}$ for the testing environment. Since there always exist a $k$ such that the policy deviation of all samples can be bounded, the Lipschitz condition for the policy holds naturally.
  • Figure 3: Evidence that our theoretical results can explain empirical algorithms. We present comparison of average representation deviation ($\mathbb{E}[\|\phi(s) - \phi(s^\prime)\|_2^2]$, left column) and average policy deviation ($\mathbb{E}[\|\pi(\phi(s)) - \pi(\phi(s^\prime))\|_2^2]$, right column) of 6 typical methods on color-hard, video-easy, and video-hard settings of walker-walk task from DMC-GB. The results are averaged over the trajectory and across 5 varied random seeds. The error bar denotes the average standard deviation along the trajectory and 5 seeds.
  • Figure 4: Evidence that our theoretical results can explain empirical algorithms. We demonstrate comparison of average representation deviation ($\mathbb{E}[\|\phi(s) - \phi(s^\prime)\|_2^2]$, left column) and average policy deviation ($\mathbb{E}[\|\pi(\phi(s)) - \pi(\phi(s^\prime))\|_2^2]$, right column) of 6 typical methods on color-hard, video-easy, and video-hard settings of finger-spin task from DMC-GB. The results are averaged over the trajectory and 5 different random seeds. The error bar represents the average standard deviation along the trajectory and 5 seeds.
  • Figure 5: Example trajectories of the training environment (first row), and DrQ (second row), SVEA (third row), and PIE-G (fourth row) deployed in the walker-walk video-easy task with distractors. Results are illustrated by using the models of each algorithm after training 500K environmental steps.
  • ...and 6 more figures

Theorems & Definitions (30)

  • Lemma 1: Policy deviation
  • proof
  • Lemma 2: State deviation
  • proof
  • Lemma 3: Reward deviation
  • proof
  • Theorem 1: Fixed policy shift error
  • proof
  • Corollary 1
  • proof
  • ...and 20 more