Table of Contents
Fetching ...

CorrectionPlanner: Self-Correction Planner with Reinforcement Learning in Autonomous Driving

Yihong Guo, Dongqiangzi Ye, Sijia Chen, Anqi Liu, Xianming Liu

Abstract

Autonomous driving requires safe planning, but most learning-based planners lack explicit self-correction ability: once an unsafe action is proposed, there is no mechanism to correct it. Thus, we propose CorrectionPlanner, an autoregressive planner with self-correction that models planning as motion-token generation within a propose, evaluate, and correct loop. At each planning step, the policy proposes an action, namely a motion token, and a learned collision critic predicts whether it will induce a collision within a short horizon. If the critic predicts a collision, we retain the sequence of historical unsafe motion tokens as a self-correction trace, generate the next motion token conditioned on it, and repeat this process until a safe motion token is proposed or the safety criterion is met. This self-correction trace, consisting of all unsafe motion tokens, represents the planner's correction process in motion-token space, analogous to a reasoning trace in language models. We train the planner with imitation learning followed by model-based reinforcement learning using rollouts from a pretrained world model that realistically models agents' reactive behaviors. Closed-loop evaluations show that CorrectionPlanner reduces collision rate by over 20% on Waymax and achieves state-of-the-art planning scores on nuPlan.

CorrectionPlanner: Self-Correction Planner with Reinforcement Learning in Autonomous Driving

Abstract

Autonomous driving requires safe planning, but most learning-based planners lack explicit self-correction ability: once an unsafe action is proposed, there is no mechanism to correct it. Thus, we propose CorrectionPlanner, an autoregressive planner with self-correction that models planning as motion-token generation within a propose, evaluate, and correct loop. At each planning step, the policy proposes an action, namely a motion token, and a learned collision critic predicts whether it will induce a collision within a short horizon. If the critic predicts a collision, we retain the sequence of historical unsafe motion tokens as a self-correction trace, generate the next motion token conditioned on it, and repeat this process until a safe motion token is proposed or the safety criterion is met. This self-correction trace, consisting of all unsafe motion tokens, represents the planner's correction process in motion-token space, analogous to a reasoning trace in language models. We train the planner with imitation learning followed by model-based reinforcement learning using rollouts from a pretrained world model that realistically models agents' reactive behaviors. Closed-loop evaluations show that CorrectionPlanner reduces collision rate by over 20% on Waymax and achieves state-of-the-art planning scores on nuPlan.
Paper Structure (28 sections, 7 equations, 11 figures, 10 tables)

This paper contains 28 sections, 7 equations, 11 figures, 10 tables.

Figures (11)

  • Figure 1: (a) Vanilla autoregressive models, no correction mechanism. (b) autoregressive models that generate tokens sequentially, with reasoning through language. (c) autoregressive models that generate subsequent tokens sequentially with self-correction through motion tokens.
  • Figure 2: Blue box is ego; red cross is collision; orange trajectory is the expert trajectory. Left: autoregressive without self-correction. Right: autoregressive with self-correction. Collision is avoided with self-correction before the collision by slowing down to yield to the right-turn vehicle and then re-accelerating.
  • Figure 3: (a) Policy architecture. A frozen reactive world model predicts future motions for the ego and other agents. Conditioned on scene history and map/navigation context, the planner is an autoregressive policy that generates the next ego motion token. When self-correction is triggered, previously rejected proposals are retained as a correction trace and integrated via a self-attention encoder to condition subsequent proposals, providing an explicit correction signal during both imitation learning and RL. (b) Self-correction. At each planning timestep, the policy proposes an ego motion token, and a learned collision critic evaluates whether it will cause a collision within a short horizon. If unsafe, the policy iteratively generates revised motion tokens conditioned on the correction trace until a safe token is found (or a maximum number of correction steps is reached). The resulting rollout contains both executed ego tokens and intermediate correction tokens, and the policy is further optimized with REINFORCE using a rule-based reward and KL regularization.
  • Figure 4: Visualization on self-correction. (a),(c) and (e): our method w/o self-correction in inference. (b), (d) and (f): our method w/ self-correction. Orange trajectory is the log trajectory. Our method avoids the collision in these scenarios.
  • Figure 5: Effect of collision classification threshold and correction length on safety and progression.
  • ...and 6 more figures