Table of Contents
Fetching ...

Learning Rollout from Sampling:An R1-Style Tokenized Traffic Simulation Model

Ziyan Wang, Peng Chen, Ding Li, Chiwei Li, Qichao Zhang, Zhongpu Xia, Guizhen Yu

Abstract

Learning diverse and high-fidelity traffic simulations from human driving demonstrations is crucial for autonomous driving evaluation. The recent next-token prediction (NTP) paradigm, widely adopted in large language models (LLMs), has been applied to traffic simulation and achieves iterative improvements via supervised fine-tuning (SFT). However, such methods limit active exploration of potentially valuable motion tokens, particularly in suboptimal regions. Entropy patterns provide a promising perspective for enabling exploration driven by motion token uncertainty. Motivated by this insight, we propose a novel tokenized traffic simulation policy, R1Sim, which represents an initial attempt to explore reinforcement learning based on motion token entropy patterns, and systematically analyzes the impact of different motion tokens on simulation outcomes. Specifically, we introduce an entropy-guided adaptive sampling mechanism that focuses on previously overlooked motion tokens with high uncertainty yet high potential. We further optimize motion behaviors using Group Relative Policy Optimization (GRPO), guided by a safety-aware reward design. Overall, these components enable a balanced exploration-exploitation trade-off through diverse high-uncertainty sampling and group-wise comparative estimation, resulting in realistic, safe, and diverse multi-agent behaviors. Extensive experiments on the Waymo Sim Agent benchmark demonstrate that R1Sim achieves competitive performance compared to state-of-the-art methods.

Learning Rollout from Sampling:An R1-Style Tokenized Traffic Simulation Model

Abstract

Learning diverse and high-fidelity traffic simulations from human driving demonstrations is crucial for autonomous driving evaluation. The recent next-token prediction (NTP) paradigm, widely adopted in large language models (LLMs), has been applied to traffic simulation and achieves iterative improvements via supervised fine-tuning (SFT). However, such methods limit active exploration of potentially valuable motion tokens, particularly in suboptimal regions. Entropy patterns provide a promising perspective for enabling exploration driven by motion token uncertainty. Motivated by this insight, we propose a novel tokenized traffic simulation policy, R1Sim, which represents an initial attempt to explore reinforcement learning based on motion token entropy patterns, and systematically analyzes the impact of different motion tokens on simulation outcomes. Specifically, we introduce an entropy-guided adaptive sampling mechanism that focuses on previously overlooked motion tokens with high uncertainty yet high potential. We further optimize motion behaviors using Group Relative Policy Optimization (GRPO), guided by a safety-aware reward design. Overall, these components enable a balanced exploration-exploitation trade-off through diverse high-uncertainty sampling and group-wise comparative estimation, resulting in realistic, safe, and diverse multi-agent behaviors. Extensive experiments on the Waymo Sim Agent benchmark demonstrate that R1Sim achieves competitive performance compared to state-of-the-art methods.

Paper Structure

This paper contains 18 sections, 12 equations, 6 figures, 4 tables, 1 algorithm.

Figures (6)

  • Figure 1: The motivation of R1Sim. (a) Exploration: Compared with Top-K sampling wu2024smart, our entropy-guided adaptive sampling explores more high-entropy motion tokens. (b) Exploitation: Compared with SFT zhang2025closed, our refined GRPO estimates group-wise advantages and selects the optimal scenario.
  • Figure 2: Temporal evolution of token entropy and its role in characterizing scene uncertainty. (a) visualizes representative low- and high-entropy motion patterns at different time steps, highlighting the interested vehicle in red. (b) shows the temporal evolution of the ranked token probability distribution and the corresponding motion token entropy, where higher entropy coincides with a flatter distribution.
  • Figure 3: An overview of R1Sim framework. Our framework follows (A) an NTP-based autoregressive formulation for sequential motion token generation. Exploration is facilitated by (B) an entropy-guided adaptive sampling strategy that allocates sampling budget according to token entropy, while exploitation is guided by (C) a token-level reward model operating in traffic simulation. Policy network is optimized using (D) GRPO, enabling learning through group relative advantage estimation.
  • Figure 4: Entropy distribution of generated scenarios.
  • Figure 5: Impact of the minimum bound $k_{min}$, maximum bound $k_{max}$ and sample ranges on RMM.
  • ...and 1 more figures