Table of Contents
Fetching ...

Domain Adversarial Reinforcement Learning

Bonnie Li, Vincent François-Lavet, Thang Doan, Joelle Pineau

TL;DR

Domain Adversarial Reinforcement Learning tackles generalization in visual RL by learning domain-invariant latent representations. It combines Soft Actor-Critic with a domain-adversarial loss implemented via a gradient reversal layer to align features across background domains. On DeepMind Control tasks, DARL yields substantial zero-shot generalization to unseen and non-stationary visual contexts, with latent spaces that remain task-relevant. This approach demonstrates that enforcing domain invariance in representations can significantly improve robustness of pixel-based RL policies for real-world-like variability.

Abstract

We consider the problem of generalization in reinforcement learning where visual aspects of the observations might differ, e.g. when there are different backgrounds or change in contrast, brightness, etc. We assume that our agent has access to only a few of the MDPs from the MDP distribution during training. The performance of the agent is then reported on new unknown test domains drawn from the distribution (e.g. unseen backgrounds). For this "zero-shot RL" task, we enforce invariance of the learned representations to visual domains via a domain adversarial optimization process. We empirically show that this approach allows achieving a significant generalization improvement to new unseen domains.

Domain Adversarial Reinforcement Learning

TL;DR

Domain Adversarial Reinforcement Learning tackles generalization in visual RL by learning domain-invariant latent representations. It combines Soft Actor-Critic with a domain-adversarial loss implemented via a gradient reversal layer to align features across background domains. On DeepMind Control tasks, DARL yields substantial zero-shot generalization to unseen and non-stationary visual contexts, with latent spaces that remain task-relevant. This approach demonstrates that enforcing domain invariance in representations can significantly improve robustness of pixel-based RL policies for real-world-like variability.

Abstract

We consider the problem of generalization in reinforcement learning where visual aspects of the observations might differ, e.g. when there are different backgrounds or change in contrast, brightness, etc. We assume that our agent has access to only a few of the MDPs from the MDP distribution during training. The performance of the agent is then reported on new unknown test domains drawn from the distribution (e.g. unseen backgrounds). For this "zero-shot RL" task, we enforce invariance of the learned representations to visual domains via a domain adversarial optimization process. We empirically show that this approach allows achieving a significant generalization improvement to new unseen domains.

Paper Structure

This paper contains 19 sections, 12 equations, 10 figures, 2 tables.

Figures (10)

  • Figure 1: Training and evaluation set up: the agent is trained in a distribution of MDPs with different visual backgrounds and evaluation is done in new domains with unknown backgrounds. To improve generalization to new MDPs from the distribution, our domain adversarial approach is specifically trained to focus on the important visual aspects of the tasks and ignore the irrelevant factors.
  • Figure 2: The proposed DARL architecture builds a domain classification module (red) along with the critic and encoder of SAC. Feature distributions across domains are aligned via the gradient reversal layer between the encoder and the domain classifier. The gradient reversal layer multiplies the gradient by a certain negative constant during back propagation. Otherwise, the training follows the standard procedure and minimizes the critic loss and domain classification loss. Gradient reversal ensures that features across domains are indistinguishable for the domain classifier, which encourages domain invariant features.
  • Figure 3: Left four: training environments for the RL agent on DeepMind control tasks. Right two: testing environments in which we evaluate the RL agent. Both sets of environments use stationary backgrounds of images, the backgrounds differ simply by shape of lines and colors.
  • Figure 4: Deepmind control tasks. Solid and dashed lines indicate training and testing environments, respectively. DARL outperforms SAC on testing environments across all four tasks. Results are averaged over 3 seeds, line shows the mean and shaded area shows standard error.
  • Figure 5: t-SNE of latent spaces learned by SAC (top) and by DARL (bottom). Green shows state representations from training environments, red shows those from unseen testing environments. While SAC learns features that are background dependent, DARL removes those task irrelevant features and learns domain-invariant representations.
  • ...and 5 more figures