Table of Contents
Fetching ...

PlanScope: Learning to Plan Within Decision Scope for Urban Autonomous Driving

Ren Xin, Jie Cheng, Hongji Liu, Jun Ma

TL;DR

PlanScope tackles the challenge that expert driving logs mix short-term reactive maneuvers with long-term directional decisions, which can mislead imitation-learning planners. It introduces Wavelet Transform-based trajectory decomposition, Detail Decoding (MDD/IDD) to generate multi-level details, and Multi-Scope Supervision with time-dependent normalization and related weighting schemes. Experiments on the nuPlan benchmark show consistent improvements in safety and efficiency metrics, with timenorm achieving a CLS-NR score of $91.32\%$ without post-processing and $93.59\%$ in a hybrid setup. The results indicate that explicit modeling of decision scopes and multi-level detail supervision is a promising direction for robust urban autonomous driving planning.

Abstract

In the context of urban autonomous driving, imitation learning-based methods have shown remarkable effectiveness, with a typical practice to minimize the discrepancy between expert driving logs and predictive decision sequences. As expert driving logs natively contain future short-term decisions with respect to events, such as sudden obstacles or rapidly changing traffic signals. We believe that unpredictable future events and corresponding expert reactions can introduce reasoning disturbances, negatively affecting the convergence efficiency of planning models. At the same time, long-term decision information, such as maintaining a reference lane or avoiding stationary obstacles, is essential for guiding short-term decisions. Our preliminary experiments on shortening the planning horizon show a rise-and-fall trend in driving performance, supporting these hypotheses. Based on these insights, we present PlanScope, a sequential-decision-learning framework with novel techniques for separating short-term and long-term decisions in decision logs. To identify and extract each decision component, the Wavelet Transform on trajectory profiles is proposed. After that, to enhance the detail-generating ability of Neural Networks, extra Detail Decoders are proposed. Finally, to enable in-scope decision supervision across detail levels, Multi-Scope Supervision strategies are adopted during training. The proposed methods, especially the time-dependent normalization, outperform baseline models in closed-loop evaluations on the nuPlan dataset, offering a plug-and-play solution to enhance existing planning models.

PlanScope: Learning to Plan Within Decision Scope for Urban Autonomous Driving

TL;DR

PlanScope tackles the challenge that expert driving logs mix short-term reactive maneuvers with long-term directional decisions, which can mislead imitation-learning planners. It introduces Wavelet Transform-based trajectory decomposition, Detail Decoding (MDD/IDD) to generate multi-level details, and Multi-Scope Supervision with time-dependent normalization and related weighting schemes. Experiments on the nuPlan benchmark show consistent improvements in safety and efficiency metrics, with timenorm achieving a CLS-NR score of without post-processing and in a hybrid setup. The results indicate that explicit modeling of decision scopes and multi-level detail supervision is a promising direction for robust urban autonomous driving planning.

Abstract

In the context of urban autonomous driving, imitation learning-based methods have shown remarkable effectiveness, with a typical practice to minimize the discrepancy between expert driving logs and predictive decision sequences. As expert driving logs natively contain future short-term decisions with respect to events, such as sudden obstacles or rapidly changing traffic signals. We believe that unpredictable future events and corresponding expert reactions can introduce reasoning disturbances, negatively affecting the convergence efficiency of planning models. At the same time, long-term decision information, such as maintaining a reference lane or avoiding stationary obstacles, is essential for guiding short-term decisions. Our preliminary experiments on shortening the planning horizon show a rise-and-fall trend in driving performance, supporting these hypotheses. Based on these insights, we present PlanScope, a sequential-decision-learning framework with novel techniques for separating short-term and long-term decisions in decision logs. To identify and extract each decision component, the Wavelet Transform on trajectory profiles is proposed. After that, to enhance the detail-generating ability of Neural Networks, extra Detail Decoders are proposed. Finally, to enable in-scope decision supervision across detail levels, Multi-Scope Supervision strategies are adopted during training. The proposed methods, especially the time-dependent normalization, outperform baseline models in closed-loop evaluations on the nuPlan dataset, offering a plug-and-play solution to enhance existing planning models.

Paper Structure

This paper contains 21 sections, 18 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: In the scenario presented, the ego Autonomous Vehicle (AV) (black rectangle) is driving on a structured road with dynamic obstacles (blue rectangle). Its long-term decision (orange arrow) is characterized by adherence to the reference line (black long dash) for driving. Conversely, the short-term decision (blue arrow) involves evading the approaching dynamic object through the application of lateral velocity. Long-term and short-term decisions constitute the general decision (purple arrow) at the moment. However, the demonstrated trajectory (green dashed arrow), which incorporates future short-term decisions (blue dashed arrow) to avoid unpredictable events (blue dashed area), is counterintuitive in the current context and introduces unexpected reasoning noise (red double-headed arrow).
  • Figure 2: Typical scenarios involving multi-scope decisions. (a) The long-term decision is a left turn. Between T = 2 s and T = 4 s, the vehicle stops to yield to pedestrians, then accelerates at T = 6 s to complete the turn. From T = 6 s and T = 11 s, it stops again upon observing the green light for cross traffic. (b) The ego vehicle first accelerates, then decelerates to secure the right of way. After observing another vehicle stop at T = 4 s, it gradually resumes acceleration. CWT of expert velocity and acceleration profiles follows below. In scenario (a), the area where the scale of wavelengths larger than 20 indicates a general speed increase as the vehicle transitions from a branch road to a main street. At smaller scales, two deceleration segments correspond to two reactive yielding maeuvers. In scenario (b), acceleration decomposition reveals an accelerate-then-deccelerate decision around T = 4 s, and a gradual speed-up around scale 25 between time steps 8-15. The colorbar for coefficient amplitude is omitted since it is unnecessary for qualitative analysis.
  • Figure 3: The overall model framework initiates with the individual embedding of driving context elements, including the Dynamic Agents (A), Static Objects (O), Autonomous Vehicle (AV), together with High-Definition Map, Traffic Signs, and Signals (M), followed by their concatenation with positional embedding (PE) pe. This concatenated context is subsequently processed through a Multi-layer Transformer encoder, iterated $L_{enc}$ times. The scenario embedding serves as Key (K) and Value (V) of the transformer decoder and is decoded to formulate agent predictions, which serve to supervise the model's understanding of the driving context comprehensively. The reference lane embedding and anchor-free variables are combined to form a Query (Q) for the scenario embedding.
  • Figure 4: Multi-scope trajectory supervision is achieved in three forms: (a) time-weighted loss on the baseline; (b) decoding final embedding as both full and decomposed trajectories; (c) iteratively decoding multi-scope detail embeddings as corresponding trajectory details.
  • Figure 5: The iterative detail decoder uses reference lane encoding as the initial query for multi-layer attention. At each iteration, extracted detail coefficients are combined with the previous query to refine further detail extraction. The initial query provides an approximation, and both it and subsequent generated details are processed via an MLP. The resulting detail embeddings are then mapped and compared against corresponding components at each level, with horizon masks configurable per level.
  • ...and 1 more figures