Task and Domain Adaptive Reinforcement Learning for Robot Control

Yu Tang Liu; Nilaksh Singh; Aamir Ahmad

Task and Domain Adaptive Reinforcement Learning for Robot Control

Yu Tang Liu, Nilaksh Singh, Aamir Ahmad

TL;DR

The paper addresses robustness of deep RL for robot control under changing tasks and environments by introducing an adaptive agent that combines task transfer (via an arbiter over specialized primitives) with domain transfer (via a learned environment state extractor trained through Rapid Motor Adaptation). The core method, Arbiter-SF, leverages successor features to enable zero-shot transfer to unseen tasks and a two-stage training scheme to estimate environment state from interactions, all validated on a parallelized blimp control platform built in IsaacGym and demonstrated in real-world flight. Compared with strong baselines, the approach achieves superior transfer performance and sample efficiency, supported by ablation studies and extensive real-world experiments. The work advances multitask, domain-robust RL in robotics and showcases practical viability for real-time control on embedded hardware with a scalable simulation pipeline.

Abstract

Deep reinforcement learning (DRL) has shown remarkable success in simulation domains, yet its application in designing robot controllers remains limited, due to its single-task orientation and insufficient adaptability to environmental changes. To overcome these limitations, we present a novel adaptive agent that leverages transfer learning techniques to dynamically adapt policy in response to different tasks and environmental conditions. The approach is validated through the blimp control challenge, where multitasking capabilities and environmental adaptability are essential. The agent is trained using a custom, highly parallelized simulator built on IsaacGym. We perform zero-shot transfer to fly the blimp in the real world to solve various tasks. We share our code at https://github.com/robot-perception-group/adaptive_agent.

Task and Domain Adaptive Reinforcement Learning for Robot Control

TL;DR

Abstract

Paper Structure (27 sections, 15 equations, 8 figures, 2 tables, 1 algorithm)

This paper contains 27 sections, 15 equations, 8 figures, 2 tables, 1 algorithm.

Introduction
Related work
Methodology
Problem Definition
Task Transfer via Successor Feature-based Arbiter
Successor Feature Policy Evaluation and Improvement
Training Scheme
Phase I: Training the SFs and Primitive Network
Phase II: Training the Feature Extractor
Supporting Techniques
Collective Learning
Imitation Learning
Auxiliary Task
Specialized Network Architecture
Experiments and Results
...and 12 more sections

Figures (8)

Figure 1: The arbiter architecture allows our adaptive agent trained purely in simulation to achieve zero-shot domain transfer and sim-to-real transfer, to control the real robot and perform unseen tasks. The arbiter selects action by observing the current task weight $w$, state of the robot $s^{rb}$, goal $s^{goal}$ and environment $s^{env}$. Our customized blimp simulation based on IsaacGym can support a high degree of parallelization to facilitate multitask learning. Each environment possesses a green and a blue pole corresponding to the hover and navigation goals' positions, respectively. The agent should navigate the blimp to different goal areas depending on the task specification.
Figure 2: Our proposed methodology: the adaptive agent follows the RMA training procedure. In the first phase, all modules except the feature extractor undergo training, focusing on training the primitives and constructing informative environmental states. In the second phase, the extractor is trained to replicate the encoder's output, ensuring that it captures the environment state. A decoder is utilized in both phases to refine the environmental representation. During training, each primitive may control several blimps simultaneously. The arbiter is introduced during the deployment phase for selecting an action.
Figure 3: Tasks performed in the simulation: Various kind of tasks can by achieved by mixing components in the task weight. (a) The hover task $w[4]$ can be combined with yaw control task $w[5]$ to track desired angle in the hover zone $w=[0,0,0,0, 1,1,0, 0,0, 0,0]$. We can also add a thrust penalty ($w[10]=0.1$) to prevent overshooting (brown). (b) The hover task $w[4]$ can be combined with velocity control $w[6]$ to track desired velocity in the hover zone $w=[0,0,0,0, 1,0,1, 0,0, 0,0]$. We can reduce the velocity weight to emphasize less on this sub-task (brown), i.e., $w_v=w[6]=0.1$. (c) A higher control performance in waypoint trigger task $w[2]$ can be achieved by mixing position $w[0]$ and velocity control tasks $w[7]$, i.e. $w=[.3,.1,1,0, 0,0,0, .3,0, 0,0]$. Compared with a velocity PID controller (brown), the agent does not blindly follow the velocity commands to prevent overshooting. (d) The agent can fly the blimp backward $w=[0,0,0,-1, 0,0,-1, 0,.1, 0,0]$ by reversing the yaw $w[3]$ and heading velocity task $w[6]$. However, backward flight is slow and inefficient. The agent can hardly maintain its altitude and ends up being reset by the simulator.
Figure 4: Experimental results on task transfer: The solid line indicates the mean, and the top and bottom edges of the hue are the maximum and minimum rewards of each experiment. Compared to (a), the learning rate in experiments (b) and (c) is reduced ten times, but the number of episodes is increased from 125 to 200 to ensure training stability for better evaluation of the supporting techniques. Note that these experiments are conducted without domain randomization. Environment factors are set to default values and excluded from the agent's observation. (a) Benchmark on transfer performance in the evaluation task set. The green dotted line indicates the final performance of the agent deployed for real-world experiments. All agents incorporate imitation loss (Sec. \ref{['sec: Imitation Learning']}). (b) All the techniques, except collective learning, significantly improve the sample efficiency and transfer performance. base: no techniques, imi: imitation learning, fta: fuzzy tiling activation, aux: predictive auxiliary task, col: collective learning. (c) Training the Arbiter-SF with more tasks may improve final performance (115 for $N=10$ and 80 for $N=5$) at the cost of learning time.
Figure 5: The color coding indicates the counts of each primitive being chosen to perform the action by the arbiter in an evaluation episode every five training episodes.
...and 3 more figures

Task and Domain Adaptive Reinforcement Learning for Robot Control

TL;DR

Abstract

Task and Domain Adaptive Reinforcement Learning for Robot Control

Authors

TL;DR

Abstract

Table of Contents

Figures (8)