Intrinsic Dynamics-Driven Generalizable Scene Representations for Vision-Oriented Decision-Making Applications

Dayang Liang; Jinyang Lai; Yunlong Liu

Intrinsic Dynamics-Driven Generalizable Scene Representations for Vision-Oriented Decision-Making Applications

Dayang Liang, Jinyang Lai, Yunlong Liu

TL;DR

Qualitative analysis results validate that the proposed intrinsic dynamics-driven representation learning method with sequence models in visual reinforcement learning, namely DSR, possesses the superior ability to learn generalizable scene representations on visual tasks.

Abstract

How to improve the ability of scene representation is a key issue in vision-oriented decision-making applications, and current approaches usually learn task-relevant state representations within visual reinforcement learning to address this problem. While prior work typically introduces one-step behavioral similarity metrics with elements (e.g., rewards and actions) to extract task-relevant state information from observations, they often ignore the inherent dynamics relationships among the elements that are essential for learning accurate representations, which further impedes the discrimination of short-term similar task/behavior information in long-term dynamics transitions. To alleviate this problem, we propose an intrinsic dynamics-driven representation learning method with sequence models in visual reinforcement learning, namely DSR. Concretely, DSR optimizes the parameterized encoder by the state-transition dynamics of the underlying system, which prompts the latent encoding information to satisfy the state-transition process and then the state space and the noise space can be distinguished. In the implementation and to further improve the representation ability of DSR on encoding similar tasks, sequential elements' frequency domain and multi-step prediction are adopted for sequentially modeling the inherent dynamics. Finally, experimental results show that DSR has achieved significant performance improvements in the visual Distracting DMControl control tasks, especially with an average of 78.9\% over the backbone baseline. Further results indicate that it also achieves the best performances in real-world autonomous driving applications on the CARLA simulator. Moreover, qualitative analysis results validate that our method possesses the superior ability to learn generalizable scene representations on visual tasks. The source code is available at https://github.com/DMU-XMU/DSR.

Intrinsic Dynamics-Driven Generalizable Scene Representations for Vision-Oriented Decision-Making Applications

TL;DR

Abstract

Paper Structure (16 sections, 1 theorem, 20 equations, 10 figures, 3 tables, 1 algorithm)

This paper contains 16 sections, 1 theorem, 20 equations, 10 figures, 3 tables, 1 algorithm.

Introduction
Preliminaries
Deep Reinforcement Learning
Actor-Critic Framework
Behavior Similarity Metrics
Discrete-Time Fourier Transform
Method
Intrinsic Dynamics
Sequential Optimization in DSR
Prediction Models for Frequency Domain
Forward dynamics
Experiments
Evaluation on the Distracting DMControl suite
Verification on the Autonomous Driving
Related Work
...and 1 more sections

Key Result

Theorem 1

Let the sequence observation data be represented by $\mathbf{o}_{t:t+T-1}$ and $\mathbf{a}_{t:t+T-1}$, and given the approximate posterior $p_\phi$ with the encoding parameters $\phi$, the evidence lower bound (ELBO) on the data log-likelihood is:

Figures (10)

Figure 1: Task-relevant state representation derived from dynamics relationships over underlying state transition in DRL.
Figure 2: Overview of the DSR framework: The method is divided into four parts: sequence encoding, frequency domain prediction, latent overshooting, and reinforcement learning, with different sections distinguished by colored connection arrows. The entire framework focuses on the encoder $\phi$ as the core training target, and the trained encoder will be used for reinforcement learning policy training. In the figure, $z_{\le t+2}$ and $z_{\le t+3}$ are shorthand for the sequential latent encoded states $z_{t:t+2}$ and $z_{t+1:t+3}$, respectively (similarly for other vectors).
Figure 3: Visual task examples for cheetah_run and walker_walk in DMControl. Left: clean DMControl setting with original background; Right: distracting DMControl setting with random background videos. Among them, the cheetah_run task is to control the six-degree-of-freedom cheetah robot to run rapidly.; the walker_walk task is to control the six-degree-of-freedom humanoid robot to walk rapidly.
Figure 4: Evaluation curves on distracting DMControl suite with unseen video background setting at 500K environment steps. For each method, the results are derived from the mean rewards and standard deviation of 3 random seed experiments. A total of 50 evaluations were completed for each experiment, where the checkpoint score for each evaluation was averaged over 10 episodes. The yellow line (DrQ+DSR) is our method.
Figure 5: Training curves of our method (DrQ+DSR) and DrQ backbone on two seen background videos.
...and 5 more figures

Theorems & Definitions (1)

Theorem 1

Intrinsic Dynamics-Driven Generalizable Scene Representations for Vision-Oriented Decision-Making Applications

TL;DR

Abstract

Intrinsic Dynamics-Driven Generalizable Scene Representations for Vision-Oriented Decision-Making Applications

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (10)

Theorems & Definitions (1)