Table of Contents
Fetching ...

Automated Reward Design for Gran Turismo

Michel Ma, Takuma Seno, Kaushik Subramanian, Peter R. Wurman, Peter Stone, Craig Sherstan

TL;DR

This paper tackles reward design in reinforcement learning for a complex racing simulation by introducing an iterative, LLM-/VLM-assisted framework that converts textual goals into executable reward functions. It replaces costly fitness metrics with learned preferences from vision-language models and uses a trajectory alignment coefficient to prune misaligned rewards, enabling automated search over reward functions. Empirical results show agents competitive with GT Sophy, and the approach yields novel behaviors while remaining generalizable beyond Gran Turismo 7. The work highlights practical gains in automated reward design, while noting computational demands and the continued need for human-in-the-loop guidance for stable performance.

Abstract

When designing reinforcement learning (RL) agents, a designer communicates the desired agent behavior through the definition of reward functions - numerical feedback given to the agent as reward or punishment for its actions. However, mapping desired behaviors to reward functions can be a difficult process, especially in complex environments such as autonomous racing. In this paper, we demonstrate how current foundation models can effectively search over a space of reward functions to produce desirable RL agents for the Gran Turismo 7 racing game, given only text-based instructions. Through a combination of LLM-based reward generation, VLM preference-based evaluation, and human feedback we demonstrate how our system can be used to produce racing agents competitive with GT Sophy, a champion-level RL racing agent, as well as generate novel behaviors, paving the way for practical automated reward design in real world applications.

Automated Reward Design for Gran Turismo

TL;DR

This paper tackles reward design in reinforcement learning for a complex racing simulation by introducing an iterative, LLM-/VLM-assisted framework that converts textual goals into executable reward functions. It replaces costly fitness metrics with learned preferences from vision-language models and uses a trajectory alignment coefficient to prune misaligned rewards, enabling automated search over reward functions. Empirical results show agents competitive with GT Sophy, and the approach yields novel behaviors while remaining generalizable beyond Gran Turismo 7. The work highlights practical gains in automated reward design, while noting computational demands and the continued need for human-in-the-loop guidance for stable performance.

Abstract

When designing reinforcement learning (RL) agents, a designer communicates the desired agent behavior through the definition of reward functions - numerical feedback given to the agent as reward or punishment for its actions. However, mapping desired behaviors to reward functions can be a difficult process, especially in complex environments such as autonomous racing. In this paper, we demonstrate how current foundation models can effectively search over a space of reward functions to produce desirable RL agents for the Gran Turismo 7 racing game, given only text-based instructions. Through a combination of LLM-based reward generation, VLM preference-based evaluation, and human feedback we demonstrate how our system can be used to produce racing agents competitive with GT Sophy, a champion-level RL racing agent, as well as generate novel behaviors, paving the way for practical automated reward design in real world applications.

Paper Structure

This paper contains 34 sections, 2 equations, 9 figures, 4 tables, 1 algorithm.

Figures (9)

  • Figure 1: Overview of an iterative LLM-based reward design pipeline. The environment and task description are fed into a black-box system to produce compliant agents, and the agent is fed back into the system for improvement, sometimes accompanied by some human feedback.
  • Figure 2: Expansion of the system block in Figure \ref{['fig:overall-iterations']} for our framework. First, reward code is generated from scratch by an LLM. The rewards are then filtered through a trajectory alignment filter to avoid training misaligned rewards. The rewards are all trained separately by an RL algorithm. Finally, a VLM or LLM selects the best agent for further iterations.
  • Figure 3: Left: Average final placements ($\pm \text{std}$) of final policies over all ten seeds, top five seeds only, and the best seed respectively. Middle: Average final placements ($\pm \text{std}$) of intermediate policies over all reward functions trained for each iteration of every seed. Right: Distribution of various racing metrics evaluated with the final agents from the top five seeds of each method.
  • Figure 4: The VLM tends to select agents that are also favored by humans. Frequency of a VLM's preferences, using different forms of inputs, to select the best, second best, third best, fourth best, and fifth best agent in a set of five agents, as decided by human experts.
  • Figure 5: a) The trajectory alignment coefficient's accuracy at predicting a reward function's post-training performance for varying sizes of preference datasets. The dashed line represents the baseline accuracy of a random guesser. b) The average correlation ($\pm \text{std}$) of all generated reward functions with respect to rewards from GT Sophy (with human) and the best LLM-based agent (with self) before (original) and after (tuned) optimizing the reward component weights towards the reference policy. c) The relationship between the correlation of a reward function with respect to the rewards from GT Sophy, and its intermediate policy's performance.
  • ...and 4 more figures