Table of Contents
Fetching ...

Exposing the Copycat Problem of Imitation-based Planner: A Novel Closed-Loop Simulator, Causal Benchmark and Joint IL-RL Baseline

Hui Zhou, Shaoshuai Shi, Hongsheng Li

TL;DR

This work tackles the copycat problem in imitation-based planners for autonomous driving by introducing a closed-loop simulator and a causality benchmark derived from the Waymo Open Dataset. It proposes a joint imitation-learning and reinforcement-learning framework, instantiated as MTR-SAC, which fuses IL features with RL to improve safety and adaptability beyond pure imitation. The causal benchmark, together with DFS-generated alternative endpoints, enables robust evaluation of generalization under identical inputs, reducing ego-state biases. The results show that the IL-RL baseline yields stronger performance in closed-loop scenarios and under diverse goals, highlighting its practical potential for safer, more flexible autonomous driving policies.

Abstract

Machine learning (ML)-based planners have recently gained significant attention. They offer advantages over traditional optimization-based planning algorithms. These advantages include fewer manually selected parameters and faster development. Within ML-based planning, imitation learning (IL) is a common algorithm. It primarily learns driving policies directly from supervised trajectory data. While IL has demonstrated strong performance on many open-loop benchmarks, it remains challenging to determine if the learned policy truly understands fundamental driving principles, rather than simply extrapolating from the ego-vehicle's initial state. Several studies have identified this limitation and proposed algorithms to address it. However, these methods often use original datasets for evaluation. In these datasets, future trajectories are heavily dependent on initial conditions. Furthermore, IL often overfits to the most common scenarios. It struggles to generalize to rare or unseen situations. To address these challenges, this work proposes: 1) a novel closed-loop simulator supporting both imitation and reinforcement learning, 2) a causal benchmark derived from the Waymo Open Dataset to rigorously assess the impact of the copycat problem, and 3) a novel framework integrating imitation learning and reinforcement learning to overcome the limitations of purely imitative approaches. The code for this work will be released soon.

Exposing the Copycat Problem of Imitation-based Planner: A Novel Closed-Loop Simulator, Causal Benchmark and Joint IL-RL Baseline

TL;DR

This work tackles the copycat problem in imitation-based planners for autonomous driving by introducing a closed-loop simulator and a causality benchmark derived from the Waymo Open Dataset. It proposes a joint imitation-learning and reinforcement-learning framework, instantiated as MTR-SAC, which fuses IL features with RL to improve safety and adaptability beyond pure imitation. The causal benchmark, together with DFS-generated alternative endpoints, enables robust evaluation of generalization under identical inputs, reducing ego-state biases. The results show that the IL-RL baseline yields stronger performance in closed-loop scenarios and under diverse goals, highlighting its practical potential for safer, more flexible autonomous driving policies.

Abstract

Machine learning (ML)-based planners have recently gained significant attention. They offer advantages over traditional optimization-based planning algorithms. These advantages include fewer manually selected parameters and faster development. Within ML-based planning, imitation learning (IL) is a common algorithm. It primarily learns driving policies directly from supervised trajectory data. While IL has demonstrated strong performance on many open-loop benchmarks, it remains challenging to determine if the learned policy truly understands fundamental driving principles, rather than simply extrapolating from the ego-vehicle's initial state. Several studies have identified this limitation and proposed algorithms to address it. However, these methods often use original datasets for evaluation. In these datasets, future trajectories are heavily dependent on initial conditions. Furthermore, IL often overfits to the most common scenarios. It struggles to generalize to rare or unseen situations. To address these challenges, this work proposes: 1) a novel closed-loop simulator supporting both imitation and reinforcement learning, 2) a causal benchmark derived from the Waymo Open Dataset to rigorously assess the impact of the copycat problem, and 3) a novel framework integrating imitation learning and reinforcement learning to overcome the limitations of purely imitative approaches. The code for this work will be released soon.

Paper Structure

This paper contains 25 sections, 2 equations, 4 figures, 5 tables, 1 algorithm.

Figures (4)

  • Figure 1: Illustration of the copycat problem. (a)-(b) At a certain intersection, the most frequently observed behavior is going straight. (c) Policies trained using imitation learning tend to perform well in scenarios that are close to the training data. However, when typical goals or behaviors are unavailable, the learned policy may generate unreasonable trajectories.
  • Figure 2: The framework is divided into three parts. The first part is the closed-loop simulator, which initializes the environment, executes actions, and outputs the resulting state and reward. The second part focuses on Imitation Learning (IL), utilizing a pre-trained MTR shi2022motion. This part takes the current state from the simulator as input and produces transformer-encoded features along with future trajectories. The third module involves reinforcement learning, specifically employing the Soft Actor-Critic (SAC) algorithm for offline learning.
  • Figure 3: One typical example in our causality benchmark illustrates going straight, turning right, and turning left. The red point is the goal.
  • Figure 4: Success and failure cases in four scenarios: U-Turn, Left-Turn, Right-Turn, and Go-Straight.