Table of Contents
Fetching ...

Learning Acrobatic Flight from Preferences

Colin Merk, Ismail Geles, Jiaxu Xing, Angel Romero, Giorgia Ramponi, Davide Scaramuzza

TL;DR

Revenge Ensemble under Confidence (REC), a probabilistic reward learning framework for PbRL that explicitly models per-timestep reward uncertainty through an ensemble of distributional reward models, is proposed and validated on a continuous control benchmark, confirming its applicability beyond the domain of aerial robotics.

Abstract

Preference-based reinforcement learning (PbRL) enables agents to learn control policies without requiring manually designed reward functions, making it well-suited for tasks where objectives are difficult to formalize or inherently subjective. Acrobatic flight poses a particularly challenging problem due to its complex dynamics, rapid movements, and the importance of precise execution. However, manually designed reward functions for such tasks often fail to capture the qualities that matter: we find that hand-crafted rewards agree with human judgment only 60.7% of the time, underscoring the need for preference-driven approaches. In this work, we propose Reward Ensemble under Confidence (REC), a probabilistic reward learning framework for PbRL that explicitly models per-timestep reward uncertainty through an ensemble of distributional reward models. By propagating uncertainty into the preference loss and leveraging disagreement for exploration, REC achieves 88.4% of shaped reward performance on acrobatic quadrotor control, compared to 55.2% with standard Preference PPO. We train policies in simulation and successfully transfer them zero-shot to the real world, demonstrating complex acrobatic maneuvers learned purely from preference feedback. We further validate REC on a continuous control benchmark, confirming its applicability beyond the domain of aerial robotics.

Learning Acrobatic Flight from Preferences

TL;DR

Revenge Ensemble under Confidence (REC), a probabilistic reward learning framework for PbRL that explicitly models per-timestep reward uncertainty through an ensemble of distributional reward models, is proposed and validated on a continuous control benchmark, confirming its applicability beyond the domain of aerial robotics.

Abstract

Preference-based reinforcement learning (PbRL) enables agents to learn control policies without requiring manually designed reward functions, making it well-suited for tasks where objectives are difficult to formalize or inherently subjective. Acrobatic flight poses a particularly challenging problem due to its complex dynamics, rapid movements, and the importance of precise execution. However, manually designed reward functions for such tasks often fail to capture the qualities that matter: we find that hand-crafted rewards agree with human judgment only 60.7% of the time, underscoring the need for preference-driven approaches. In this work, we propose Reward Ensemble under Confidence (REC), a probabilistic reward learning framework for PbRL that explicitly models per-timestep reward uncertainty through an ensemble of distributional reward models. By propagating uncertainty into the preference loss and leveraging disagreement for exploration, REC achieves 88.4% of shaped reward performance on acrobatic quadrotor control, compared to 55.2% with standard Preference PPO. We train policies in simulation and successfully transfer them zero-shot to the real world, demonstrating complex acrobatic maneuvers learned purely from preference feedback. We further validate REC on a continuous control benchmark, confirming its applicability beyond the domain of aerial robotics.

Paper Structure

This paper contains 30 sections, 22 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Overview of the proposed approach. (a) In simulation, trajectory pairs $(\tau_1, \tau_2)$ are presented to an annotator (human or synthetic) to collect preference labels. These labels train an ensemble of reward models with uncertainty estimation, which provides the reward signal for policy optimization via reinforcement learning. (b) The resulting policy is transferred zero-shot to a real quadrotor to execute acrobatic maneuvers.
  • Figure 2: Average reward on the walker-walk task with 1000 synthetic preferences while training. Shaded areas denote standard deviation across 3 seeds. PPO with the shaped environment reward serves as the upper baseline. Ablation components are introduced incrementally to obtain REC Preference PPO.
  • Figure 3: Stop-motion visualizations of the highest-reward evaluation rollout in simulation for each training configuration. (a) PPO with shaped rewards. (b) Preference PPO with 1000 synthetic preferences. (c) REC Preference PPO with 1000 synthetic preferences. (d) REC Preference PPO with 1000 human-labeled preferences.
  • Figure 4: Long-exposure photographs of real-world deployment across four training configurations. (a) PPO with shaped rewards. (b) Preference PPO with 1000 synthetic preferences. (c) REC Preference PPO with 1000 synthetic preferences. (d) REC Preference PPO with 1000 human-labeled preferences on the powerloop task.
  • Figure 5: Mean evaluation reward over 3 seeds during training of the continuous powerloop with 1000 synthetic preferences. Shaded areas indicate standard deviation.
  • ...and 1 more figures