Table of Contents
Fetching ...

A Champion-level Vision-based Reinforcement Learning Agent for Competitive Racing in Gran Turismo 7

Hojoon Lee, Takuma Seno, Jun Jet Tai, Kaushik Subramanian, Kenta Kawamoto, Peter Stone, Peter R. Wurman

TL;DR

This work addresses the gap in real-world applicability of deep RL for autonomous racing by presenting a vision-based agent that operates using ego-centric camera data and onboard sensors, without inference-time global localization. It introduces an asymmetric actor-critic architecture trained with QR-SAC, where the actor uses local vision and proprioception while the critic leverages global features during training to improve policy quality. The agent achieves champion-level performance against GT7's built-in AI across three tracks, outperforming human champions in several scenarios and surpassing GT Sophy in others, with ablations confirming the importance of memory and global-information utilization during training. The results highlight the potential of vision-based reinforcement learning for high-speed, multi-agent racing and pave the way for practical deployment with reduced dependence on external localization and instrumentation.

Abstract

Deep reinforcement learning has achieved superhuman racing performance in high-fidelity simulators like Gran Turismo 7 (GT7). It typically utilizes global features that require instrumentation external to a car, such as precise localization of agents and opponents, limiting real-world applicability. To address this limitation, we introduce a vision-based autonomous racing agent that relies solely on ego-centric camera views and onboard sensor data, eliminating the need for precise localization during inference. This agent employs an asymmetric actor-critic framework: the actor uses a recurrent neural network with the sensor data local to the car to retain track layouts and opponent positions, while the critic accesses the global features during training. Evaluated in GT7, our agent consistently outperforms GT7's built-drivers. To our knowledge, this work presents the first vision-based autonomous racing agent to demonstrate champion-level performance in competitive racing scenarios.

A Champion-level Vision-based Reinforcement Learning Agent for Competitive Racing in Gran Turismo 7

TL;DR

This work addresses the gap in real-world applicability of deep RL for autonomous racing by presenting a vision-based agent that operates using ego-centric camera data and onboard sensors, without inference-time global localization. It introduces an asymmetric actor-critic architecture trained with QR-SAC, where the actor uses local vision and proprioception while the critic leverages global features during training to improve policy quality. The agent achieves champion-level performance against GT7's built-in AI across three tracks, outperforming human champions in several scenarios and surpassing GT Sophy in others, with ablations confirming the importance of memory and global-information utilization during training. The results highlight the potential of vision-based reinforcement learning for high-speed, multi-agent racing and pave the way for practical deployment with reduced dependence on external localization and instrumentation.

Abstract

Deep reinforcement learning has achieved superhuman racing performance in high-fidelity simulators like Gran Turismo 7 (GT7). It typically utilizes global features that require instrumentation external to a car, such as precise localization of agents and opponents, limiting real-world applicability. To address this limitation, we introduce a vision-based autonomous racing agent that relies solely on ego-centric camera views and onboard sensor data, eliminating the need for precise localization during inference. This agent employs an asymmetric actor-critic framework: the actor uses a recurrent neural network with the sensor data local to the car to retain track layouts and opponent positions, while the critic accesses the global features during training. Evaluated in GT7, our agent consistently outperforms GT7's built-drivers. To our knowledge, this work presents the first vision-based autonomous racing agent to demonstrate champion-level performance in competitive racing scenarios.

Paper Structure

This paper contains 20 sections, 2 equations, 6 figures, 1 table.

Figures (6)

  • Figure 1: Top: Our agent controlling an Audi TT Cup and racing against GT7's built-in AI (BIAI). Bottom: Histogram of the winning margin, the distance between our agent and the leading BIAI at race completion. This evaluation involves starting from the last position versus 19 identical BIAI agents on the Tokyo Expressway track. Our agent consistently outperforms both Human Expert and Human Champion.
  • Figure 2: Architecture Overview. The actor processes the image and proprioceptive features to predict actions, using a recurrent memory to track opponents and track layouts. The critic evaluates these actions using the global features. Both networks are jointly trained with the QR-SAC algorithm.
  • Figure 3: Racing Scenarios. Visualization of track, car, and sample input image for each training scenario.
  • Figure 4: Performance comparison of our agent, GT Sophy, a Human Expert, and a Human Champion.Car collision time is the total duration of contact with any opponent, while winning margin is the distance between the agent and the highest-ranked opponent when the agent completes all four laps. The contours represent the density of data points, with denser regions indicating more frequent occurrences of certain performance outcomes. Upper-right regions indicate superior performance, as they represent larger winning margin achieved with lower car collision times.
  • Figure 5: Visualizing our agent's trajectory and action attributions in the Spa scenario. The sequence is shown in 0.5-second intervals and consists of three rows: Top: displays the trajectory of our agent (red) and a BIAI opponent (black); Middle: shows attribution maps using Integrated Gradients, highlighting the agent's focus on lower vehicle regions for overtaking opportunities or treelines for track layout. Bottom: illustrates how visual features from the past frames contribute to actions predicted for the final frame, demonstrating the agent's ability to infer information that is not included in the final frame.
  • ...and 1 more figures