AD3: Implicit Action is the Key for World Models to Distinguish the Diverse Visual Distractors

Yucen Wang; Shenghua Wan; Le Gan; Shuai Feng; De-Chuan Zhan

AD3: Implicit Action is the Key for World Models to Distinguish the Diverse Visual Distractors

Yucen Wang, Shenghua Wan, Le Gan, Shuai Feng, De-Chuan Zhan

TL;DR

This work targets visual RL under distractors, focusing on homogeneous distractors that visually resemble the controllable agent. It introduces the Implicit-Action Block MDP (IABMDP) and the Implicit Action Generator (IAG) to infer implicit distractor actions, enabling AD3 to train separated world models conditioned on agent actions and implicit distractor actions. AD3 demonstrates superior performance across DeepMind Control Suite tasks with both heterogeneous and homogeneous distractors, and extensive ablations reveal the implicit actions’ critical role and interpretable semantics. The approach is plug-and-play and capable of integrating with other model-based RL backbones, with practical implications for robust visual control in distraction-rich settings.

Abstract

Model-based methods have significantly contributed to distinguishing task-irrelevant distractors for visual control. However, prior research has primarily focused on heterogeneous distractors like noisy background videos, leaving homogeneous distractors that closely resemble controllable agents largely unexplored, which poses significant challenges to existing methods. To tackle this problem, we propose Implicit Action Generator (IAG) to learn the implicit actions of visual distractors, and present a new algorithm named implicit Action-informed Diverse visual Distractors Distinguisher (AD3), that leverages the action inferred by IAG to train separated world models. Implicit actions effectively capture the behavior of background distractors, aiding in distinguishing the task-irrelevant components, and the agent can optimize the policy within the task-relevant state space. Our method achieves superior performance on various visual control tasks featuring both heterogeneous and homogeneous distractors. The indispensable role of implicit actions learned by IAG is also empirically validated.

AD3: Implicit Action is the Key for World Models to Distinguish the Diverse Visual Distractors

TL;DR

Abstract

Paper Structure (33 sections, 12 equations, 17 figures, 4 tables, 1 algorithm)

This paper contains 33 sections, 12 equations, 17 figures, 4 tables, 1 algorithm.

Introduction
Related Work
Preliminaries
Methods
Implicit Action Generator
Action-conditioned Separated World Models
Policy Learning
Remarks.
Experiments
Environments and Tasks
How well does AD3 perform in environments with visual inputs that contain complex distractors?
How important are the implicit actions for filtering out task-irrelevant information in visual RL tasks?
What impact do different design choices in AD3 and IAG have on the experimental results?
Do implicit actions learned by the IAG module possess interpretable semantics?
Discussion
...and 18 more sections

Figures (17)

Figure 1: The IABMDP assumption and the architecture of IAG
Figure 2: Performance evaluation of AD3 and baselines over 4 seeds across four visual control tasks, each equipped with two representative distractors: Agent Shifted and Natural Video Backgrounds. The solid curves and the shaded region indicate the average episodic returns and the standard error across different runs, respectively. AD3 is the only method that consistently performs well across all tasks and distractor variants.
Figure 3: Performance and reconstruction results for different semantics of the observation, when using 4 distinct types of distractor actions for learning the task-irrelevant model under the Agent Shifted setting. Each experiment involves two tasks: Cheetah Run and Walker Run. When employing the ground truth action of the distractor, effective separation between the primary agent and the shifted distractor is achieved, and so do implicit actions learned by IAG, underscoring the efficacy of the implicit actions and their semantic consistency with actual distractor actions. Using agent action leads to a reversal in the representation of the two components, and the reconstructed $\hat{o}^+$ contains little task-related information. The "no action" approach tends to preserve most of the information in the task-relevant part, causing failure in the objective of distractor filtering.
Figure 4: Effects of different implicit actions in Cheetah Run + AS (the size of implicit actions is 4). Conditioned on the same initial observation and identical agent action sequences from the original trajectory, we rollout FIAD for 10 steps using 6 different implicit actions of the distractor sampled from the categorical action space. These implicit actions, each represented by 4 one-hot codes with indices indicating active positions in the categorical variables, generate 6 distinct synthetic trajectories where the shifted agent exhibits different behaviors. This demonstrates that the learned implicit action space is rich in the semantic information of the underlying distractors.
Figure 5: Forward imagination on one trajectory using implicit actions inferred from another. Using FIAD in IAG, we generate a synthetic trajectory with agent actions from Traj-B and implicit distractor actions inferred from Traj-A by TAID. The imagined trajectory exhibits the behavior of the shifted agent in Traj-A and the controllable agent in Traj-B, without incorporating task-relevant semantics from Traj-A into Traj-B. A similar result is observed when we reverse the two trajectories. This illustrates the effective disentanglement of learned implicit actions from original agent actions.
...and 12 more figures

AD3: Implicit Action is the Key for World Models to Distinguish the Diverse Visual Distractors

TL;DR

Abstract

AD3: Implicit Action is the Key for World Models to Distinguish the Diverse Visual Distractors

Authors

TL;DR

Abstract

Table of Contents

Figures (17)