EfficientZero V2: Mastering Discrete and Continuous Control with Limited Data

Shengjie Wang; Shaohuai Liu; Weirui Ye; Jiacheng You; Yang Gao

EfficientZero V2: Mastering Discrete and Continuous Control with Limited Data

Shengjie Wang, Shaohuai Liu, Weirui Ye, Jiacheng You, Yang Gao

TL;DR

This paper introduces EfficientZero V2, a general framework designed for sample-efficient RL algorithms, and expands the performance of EfficientZero to multiple domains, encompassing both continuous and discrete actions, as well as visual and low-dimensional inputs.

Abstract

Sample efficiency remains a crucial challenge in applying Reinforcement Learning (RL) to real-world tasks. While recent algorithms have made significant strides in improving sample efficiency, none have achieved consistently superior performance across diverse domains. In this paper, we introduce EfficientZero V2, a general framework designed for sample-efficient RL algorithms. We have expanded the performance of EfficientZero to multiple domains, encompassing both continuous and discrete actions, as well as visual and low-dimensional inputs. With a series of improvements we propose, EfficientZero V2 outperforms the current state-of-the-art (SOTA) by a significant margin in diverse tasks under the limited data setting. EfficientZero V2 exhibits a notable advancement over the prevailing general algorithm, DreamerV3, achieving superior outcomes in 50 of 66 evaluated tasks across diverse benchmarks, such as Atari 100k, Proprio Control, and Vision Control.

EfficientZero V2: Mastering Discrete and Continuous Control with Limited Data

TL;DR

Abstract

Paper Structure (43 sections, 2 theorems, 26 equations, 8 figures, 7 tables)

This paper contains 43 sections, 2 theorems, 26 equations, 8 figures, 7 tables.

Introduction
Related work
Sample Efficient RL
MCTS-based Work
Preliminary
Reinforcement Learning
Gumbel-Top-k Trick
EfficientZero
Newtork Structure
Training Process
Method
Overview
Policy Learning with Tree Search
Target Policy from Tree Search
Learning using Target Policy
...and 28 more sections

Key Result

Corollary 4.3

Define $s_t,a_t,r_t$ to be the states, actions, and rewards resulting from current policy $\pi$ using true dynamics $\mathcal{G}^*$ and reward function $\mathcal{R}^*$, starting from $s_0\sim\nu$ and similarly define $\hat{s}_t, \hat{a}_t, \hat{r_t}$ using learned function $\mathcal{G}$. Let reward within a tree-search process. Then we have errors where $N$ is the simulation number of the search

Figures (8)

Figure 1: Comparison between EfficientZero V2, DreamerV3 and other SOTAs in each domain. We evaluate them under the Atari 100k, DMControl Proprio, and DMControl Vision benchmarks. We then set the performance of the previous SOTA as 1, allowing us to derive normalized mean scores for both EfficientZero V2 and Dreamer V3. EfficientZero V2 surpasses or closely matches the previous SOTA in each domain.
Figure 2: Framework of EZ-V2. (A) How EZ-V2 trains its model. The representation $\mathcal{H}$ takes observations as inputs and outputs the state. The dynamic model $\mathcal{G}$ predicts the next state and reward based on the current state and action. Sampling-based Gumbel search outputs the target policy $\pi_t$ and target value $z_t$. (B): How the sampling-based Gumbel search uses the model to plan. The process contains action sampling and selection. The iterative action selection outputs the recommended action $a^*_S$, search-based value target (target value), and improved policy (target policy).
Figure 3: Ablation study of our search method, namely the sampling-based Gumbel search (S-Gumbel search). We compare it with our search method with different numbers of simulations (n=16, 8) and Sample MCTS hubert2021learning. Our method outperforms Sample MCTS, and increasing the number of simulations improves our method's performance on hard tasks.
Figure 4: Ablation study of our value target, known as the mixed value target. We compare it with different value targets, including the multi-step TD target and the double Q-value target. The mixed value target consistently achieves high performance in both Proprio Control and Vision Control tasks.
Figure 5: Intuitive example showing the difference between simple policy loss using $a^*_S$ and cross-entropy loss.
...and 3 more figures

Theorems & Definitions (6)

Definition 4.1: Policy Improvement
Definition 4.2: Search-Based Value Estimation
Corollary 4.3: Search-Based Value Estimation Error
Definition 4.1: Search-Based Value Estimation
Corollary 5.1: Search-Based Value Estimation Error
proof

EfficientZero V2: Mastering Discrete and Continuous Control with Limited Data

TL;DR

Abstract

EfficientZero V2: Mastering Discrete and Continuous Control with Limited Data

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (6)