CorrectionPlanner: Self-Correction Planner with Reinforcement Learning in Autonomous Driving

Yihong Guo; Dongqiangzi Ye; Sijia Chen; Anqi Liu; Xianming Liu

CorrectionPlanner: Self-Correction Planner with Reinforcement Learning in Autonomous Driving

Yihong Guo, Dongqiangzi Ye, Sijia Chen, Anqi Liu, Xianming Liu

Abstract

Autonomous driving requires safe planning, but most learning-based planners lack explicit self-correction ability: once an unsafe action is proposed, there is no mechanism to correct it. Thus, we propose CorrectionPlanner, an autoregressive planner with self-correction that models planning as motion-token generation within a propose, evaluate, and correct loop. At each planning step, the policy proposes an action, namely a motion token, and a learned collision critic predicts whether it will induce a collision within a short horizon. If the critic predicts a collision, we retain the sequence of historical unsafe motion tokens as a self-correction trace, generate the next motion token conditioned on it, and repeat this process until a safe motion token is proposed or the safety criterion is met. This self-correction trace, consisting of all unsafe motion tokens, represents the planner's correction process in motion-token space, analogous to a reasoning trace in language models. We train the planner with imitation learning followed by model-based reinforcement learning using rollouts from a pretrained world model that realistically models agents' reactive behaviors. Closed-loop evaluations show that CorrectionPlanner reduces collision rate by over 20% on Waymax and achieves state-of-the-art planning scores on nuPlan.

CorrectionPlanner: Self-Correction Planner with Reinforcement Learning in Autonomous Driving

Abstract

Paper Structure (28 sections, 7 equations, 11 figures, 10 tables)

This paper contains 28 sections, 7 equations, 11 figures, 10 tables.

Introduction
Related Work
Preliminaries
Self-Correction Autonomous Driving as Model-Based Reinforcement Learning
Learned World Model for Multi-Agent Dynamics
Method
Policy Network
Self-Correction with Reinforcement Learning
Training the Collision Critic
Experiment
Experimental Setup
Main Results
Ablation studies
Comparison with different baselines
Results on classification threshold and correction length
...and 13 more sections

Figures (11)

Figure 1: (a) Vanilla autoregressive models, no correction mechanism. (b) autoregressive models that generate tokens sequentially, with reasoning through language. (c) autoregressive models that generate subsequent tokens sequentially with self-correction through motion tokens.
Figure 2: Blue box is ego; red cross is collision; orange trajectory is the expert trajectory. Left: autoregressive without self-correction. Right: autoregressive with self-correction. Collision is avoided with self-correction before the collision by slowing down to yield to the right-turn vehicle and then re-accelerating.
Figure 3: (a) Policy architecture. A frozen reactive world model predicts future motions for the ego and other agents. Conditioned on scene history and map/navigation context, the planner is an autoregressive policy that generates the next ego motion token. When self-correction is triggered, previously rejected proposals are retained as a correction trace and integrated via a self-attention encoder to condition subsequent proposals, providing an explicit correction signal during both imitation learning and RL. (b) Self-correction. At each planning timestep, the policy proposes an ego motion token, and a learned collision critic evaluates whether it will cause a collision within a short horizon. If unsafe, the policy iteratively generates revised motion tokens conditioned on the correction trace until a safe token is found (or a maximum number of correction steps is reached). The resulting rollout contains both executed ego tokens and intermediate correction tokens, and the policy is further optimized with REINFORCE using a rule-based reward and KL regularization.
Figure 4: Visualization on self-correction. (a),(c) and (e): our method w/o self-correction in inference. (b), (d) and (f): our method w/ self-correction. Orange trajectory is the log trajectory. Our method avoids the collision in these scenarios.
Figure 5: Effect of collision classification threshold and correction length on safety and progression.
...and 6 more figures

CorrectionPlanner: Self-Correction Planner with Reinforcement Learning in Autonomous Driving

Abstract

CorrectionPlanner: Self-Correction Planner with Reinforcement Learning in Autonomous Driving

Authors

Abstract

Table of Contents

Figures (11)