Table of Contents
Fetching ...

Environment as Policy: Learning to Race in Unseen Tracks

Hongze Wang, Jiaxu Xing, Nico Messikommer, Davide Scaramuzza

TL;DR

The paper tackles the problem of generalizing drone-racing policies to unseen and dynamic tracks. It introduces an adaptive environment-shaping framework where a separate Environment Policy curates training tracks, guided by a ranking-based reward that favors intermediate difficulty, and the Racing Policy learns time-optimal control via standard RL objectives. By alternating updates between the two policies and training in multiple parallel simulations, the approach yields a single racing policy that generalizes to diverse tracks and even moving-gate scenarios, outperforming domain randomization and curriculum baselines. Real-world experiments with precise state estimation demonstrate 100% success on unseen tracks, highlighting significant gains in robustness and transfer for agile robotics in open, dynamic environments.

Abstract

Reinforcement learning (RL) has achieved outstanding success in complex robot control tasks, such as drone racing, where the RL agents have outperformed human champions in a known racing track. However, these agents fail in unseen track configurations, always requiring complete retraining when presented with new track layouts. This work aims to develop RL agents that generalize effectively to novel track configurations without retraining. The naive solution of training directly on a diverse set of track layouts can overburden the agent, resulting in suboptimal policy learning as the increased complexity of the environment impairs the agent's ability to learn to fly. To enhance the generalizability of the RL agent, we propose an adaptive environment-shaping framework that dynamically adjusts the training environment based on the agent's performance. We achieve this by leveraging a secondary RL policy to design environments that strike a balance between being challenging and achievable, allowing the agent to adapt and improve progressively. Using our adaptive environment shaping, one single racing policy efficiently learns to race in diverse challenging tracks. Experimental results validated in both simulation and the real world show that our method enables drones to successfully fly complex and unseen race tracks, outperforming existing environment-shaping techniques. Project page: http://rpg.ifi.uzh.ch/env_as_policy.

Environment as Policy: Learning to Race in Unseen Tracks

TL;DR

The paper tackles the problem of generalizing drone-racing policies to unseen and dynamic tracks. It introduces an adaptive environment-shaping framework where a separate Environment Policy curates training tracks, guided by a ranking-based reward that favors intermediate difficulty, and the Racing Policy learns time-optimal control via standard RL objectives. By alternating updates between the two policies and training in multiple parallel simulations, the approach yields a single racing policy that generalizes to diverse tracks and even moving-gate scenarios, outperforming domain randomization and curriculum baselines. Real-world experiments with precise state estimation demonstrate 100% success on unseen tracks, highlighting significant gains in robustness and transfer for agile robotics in open, dynamic environments.

Abstract

Reinforcement learning (RL) has achieved outstanding success in complex robot control tasks, such as drone racing, where the RL agents have outperformed human champions in a known racing track. However, these agents fail in unseen track configurations, always requiring complete retraining when presented with new track layouts. This work aims to develop RL agents that generalize effectively to novel track configurations without retraining. The naive solution of training directly on a diverse set of track layouts can overburden the agent, resulting in suboptimal policy learning as the increased complexity of the environment impairs the agent's ability to learn to fly. To enhance the generalizability of the RL agent, we propose an adaptive environment-shaping framework that dynamically adjusts the training environment based on the agent's performance. We achieve this by leveraging a secondary RL policy to design environments that strike a balance between being challenging and achievable, allowing the agent to adapt and improve progressively. Using our adaptive environment shaping, one single racing policy efficiently learns to race in diverse challenging tracks. Experimental results validated in both simulation and the real world show that our method enables drones to successfully fly complex and unseen race tracks, outperforming existing environment-shaping techniques. Project page: http://rpg.ifi.uzh.ch/env_as_policy.

Paper Structure

This paper contains 11 sections, 3 equations, 3 figures, 2 tables, 1 algorithm.

Figures (3)

  • Figure 1: Overview of the proposed method. In every N iteration, the environment policy (left) takes as input the information of the racing policy evaluations and the current environments. It generates actions to adjust the gate layouts independently for each parallel environment. The racing policy (right) utilizes the information about drone and gate states from these simulation environments to learn time-optimal drone racing strategies through an MLP.
  • Figure 2: Visualization of the drone racing tracks used for the experiments, each characterized by varying levels of complexity. All the tracks maintain a consistent size scale, spanning widths from 8 meters to 16 meters.
  • Figure 3: Ablation study on the progress reward. Due to the fluctuations in evaluation results across different iterations, to fairly compare the performance of different coefficients, we take the average and variance of the success rate and lap time after the model has stabilized for comparison.