Table of Contents
Fetching ...

Advancing Multi-agent Traffic Simulation via R1-Style Reinforcement Fine-Tuning

Muleilan Pei, Shaoshuai Shi, Shaojie Shen

TL;DR

This work proposes SMART-R1, a novel R1-style reinforcement fine-tuning paradigm tailored for next-token prediction models to better align agent behavior with human preferences and evaluation metrics and introduces a metric-oriented policy optimization algorithm to improve distribution alignment.

Abstract

Scalable and realistic simulation of multi-agent traffic behavior is critical for advancing autonomous driving technologies. Although existing data-driven simulators have made significant strides in this domain, they predominantly rely on supervised learning to align simulated distributions with real-world driving scenarios. A persistent challenge, however, lies in the distributional shift that arises between training and testing, which often undermines model generalization in unseen environments. To address this limitation, we propose SMART-R1, a novel R1-style reinforcement fine-tuning paradigm tailored for next-token prediction models to better align agent behavior with human preferences and evaluation metrics. Our approach introduces a metric-oriented policy optimization algorithm to improve distribution alignment and an iterative "SFT-RFT-SFT" training strategy that alternates between Supervised Fine-Tuning (SFT) and Reinforcement Fine-Tuning (RFT) to maximize performance gains. Extensive experiments on the large-scale Waymo Open Motion Dataset (WOMD) validate the effectiveness of this simple yet powerful R1-style training framework in enhancing foundation models. The results on the Waymo Open Sim Agents Challenge (WOSAC) showcase that SMART-R1 achieves state-of-the-art performance with an overall realism meta score of 0.7858, ranking first on the leaderboard at the time of submission.

Advancing Multi-agent Traffic Simulation via R1-Style Reinforcement Fine-Tuning

TL;DR

This work proposes SMART-R1, a novel R1-style reinforcement fine-tuning paradigm tailored for next-token prediction models to better align agent behavior with human preferences and evaluation metrics and introduces a metric-oriented policy optimization algorithm to improve distribution alignment.

Abstract

Scalable and realistic simulation of multi-agent traffic behavior is critical for advancing autonomous driving technologies. Although existing data-driven simulators have made significant strides in this domain, they predominantly rely on supervised learning to align simulated distributions with real-world driving scenarios. A persistent challenge, however, lies in the distributional shift that arises between training and testing, which often undermines model generalization in unseen environments. To address this limitation, we propose SMART-R1, a novel R1-style reinforcement fine-tuning paradigm tailored for next-token prediction models to better align agent behavior with human preferences and evaluation metrics. Our approach introduces a metric-oriented policy optimization algorithm to improve distribution alignment and an iterative "SFT-RFT-SFT" training strategy that alternates between Supervised Fine-Tuning (SFT) and Reinforcement Fine-Tuning (RFT) to maximize performance gains. Extensive experiments on the large-scale Waymo Open Motion Dataset (WOMD) validate the effectiveness of this simple yet powerful R1-style training framework in enhancing foundation models. The results on the Waymo Open Sim Agents Challenge (WOSAC) showcase that SMART-R1 achieves state-of-the-art performance with an overall realism meta score of 0.7858, ranking first on the leaderboard at the time of submission.

Paper Structure

This paper contains 18 sections, 6 equations, 7 figures, 7 tables, 1 algorithm.

Figures (7)

  • Figure 1: Training pipeline of SMART-R1. The solid square represents the agent state at the current timestep. The Open-Loop NTP predicts the next single-step state, while Closed-Loop SFT rolls out entire trajectories autoregressively to identify those closest to the ground truth. Both stages are optimized with per-token cross-entropy loss. In contrast, Closed-Loop RFT performs full rollouts but aligns trajectories with evaluation preferences through reward feedback and policy optimization.
  • Figure 2: Framework of SMART-R1. The model first tokenizes the driving scene context into agent motion tokens and map tokens. In the BC pretraining stage, the model is optimized under the standard NTP paradigm. During the SFT stage, CAT-K rollouts are used to refine the model in a closed-loop setting. Finally, in the RFT stage, the proposed MPO algorithm aligns the policy with target evaluation metrics, further improving simulation realism.
  • Figure 3: Pipeline of the Metric-oriented Policy Optimization (MPO).
  • Figure 4: Simulation results of SMART-R1. The top row shows the ground-truth (GT) scenario, while the two rows below illustrate two representative rollouts generated by our model for the same scene. The blue box denotes the ego vehicle, and white boxes denote other traffic participants, including cars and pedestrians. Transparent boxes indicate ground-truth agent positions, whereas solid boxes represent simulated agents. Green lanes indicate traversable paths under a green light, while red lanes denote restricted paths.
  • Figure 5: Simulation results of SMART-R1 in a U-turn scenario. The top row shows the ground-truth (GT) scenario. The blue box denotes the ego vehicle, and white boxes denote other traffic participants, including cars and pedestrians. Transparent boxes indicate ground-truth agent positions, whereas solid boxes represent simulated agents. Green lanes indicate traversable paths under a green light, while red lanes denote restricted paths. The visualization depicts the simulated ego vehicle completing the U-turn in exact agreement with the logged trajectory.
  • ...and 2 more figures