Table of Contents
Fetching ...

Efficient Real-World Autonomous Racing via Attenuated Residual Policy Optimization

Raphael Trumpp, Denis Hoornaert, Mirco Theile, Marco Caccamo

Abstract

Residual policy learning (RPL), in which a learned policy refines a static base policy using deep reinforcement learning (DRL), has shown strong performance across various robotic applications. Its effectiveness is particularly evident in autonomous racing, a domain that serves as a challenging benchmark for real-world DRL. However, deploying RPL-based controllers introduces system complexity and increases inference latency. We address this by introducing an extension of RPL named attenuated residual policy optimization ($α$-RPO). Unlike standard RPL, $α$-RPO yields a standalone neural policy by progressively attenuating the base policy, which initially serves to bootstrap learning. Furthermore, this mechanism enables a form of privileged learning, where the base policy is permitted to use sensor modalities not required for final deployment. We design $α$-RPO to integrate seamlessly with PPO, ensuring that the attenuated influence of the base controller is dynamically compensated during policy optimization. We evaluate $α$-RPO by building a framework for 1:10-scaled autonomous racing around it. In both simulation and zero-shot real-world transfer to Roboracer cars, $α$-RPO not only reduces system complexity but also improves driving performance compared to baselines - demonstrating its practicality for robotic deployment. Our code is available at: https://github.com/raphajaner/arpo_racing.

Efficient Real-World Autonomous Racing via Attenuated Residual Policy Optimization

Abstract

Residual policy learning (RPL), in which a learned policy refines a static base policy using deep reinforcement learning (DRL), has shown strong performance across various robotic applications. Its effectiveness is particularly evident in autonomous racing, a domain that serves as a challenging benchmark for real-world DRL. However, deploying RPL-based controllers introduces system complexity and increases inference latency. We address this by introducing an extension of RPL named attenuated residual policy optimization (-RPO). Unlike standard RPL, -RPO yields a standalone neural policy by progressively attenuating the base policy, which initially serves to bootstrap learning. Furthermore, this mechanism enables a form of privileged learning, where the base policy is permitted to use sensor modalities not required for final deployment. We design -RPO to integrate seamlessly with PPO, ensuring that the attenuated influence of the base controller is dynamically compensated during policy optimization. We evaluate -RPO by building a framework for 1:10-scaled autonomous racing around it. In both simulation and zero-shot real-world transfer to Roboracer cars, -RPO not only reduces system complexity but also improves driving performance compared to baselines - demonstrating its practicality for robotic deployment. Our code is available at: https://github.com/raphajaner/arpo_racing.
Paper Structure (50 sections, 12 equations, 14 figures, 8 tables, 1 algorithm)

This paper contains 50 sections, 12 equations, 14 figures, 8 tables, 1 algorithm.

Figures (14)

  • Figure 1: We test our proposed $\alpha$-RPO method by learning competitive real-world racing behavior with 1:10-scaled autonomous Roboracer cars. Compared to classical RPL, $\alpha$-RPO attenuates the contribution of the base policy during training, improving final performance while yielding a standalone neural policy at inference time for efficient deployment.
  • Figure 2: Learning curves with the episodic values of the return during training, showing both the fraction of the return corresponding only to the lap progress (left) and the total sum (right). Additionally, the cumulative number of crashes during training is shown (mid). The agents are trained for 2.5M interaction steps on 15 different maps.
  • Figure 3: Learning curves showing the progress-return for our ablation studies. We ablate the synchronization trick (left), longer $\alpha$-schedules (mid), and design aspects of the DNN (right).
  • Figure 4: Qualitative comparison of trajectories (top) and speed profiles (bottom) on three racetracks, showing a single flying lap. Markers indicate a specific poi on the racetracks: $+$ marks a high-speed section (red), $-$ indicates the section with minimal speed (blue), while $\mathrm F$ is the finish line (yellow).
  • Figure 5: Real-world trajectories of arpo on the Munich racetrack for 5 laps (top), including a varaition with two obstacles placed; current speed shown as color bar. Positions are estimated offline from rosbag data using SLAM. The agent's actions command (cmd) and measured speed (meas.) are also shown (bottom). Markers indicate a specific poi on the racetracks: $+$ marks a high-speed section (red), $-$ indicates the section with minimal speed (blue), while $\mathrm F$ is the finish line (yellow).
  • ...and 9 more figures