Table of Contents
Fetching ...

ParkDiffusion++: Ego Intention Conditioned Joint Multi-Agent Trajectory Prediction for Automated Parking using Diffusion Models

Jiarong Wei, Anna Rehr, Christian Feist, Abhinav Valada

TL;DR

ParkDiffusion++ is proposed, which jointly learns a multi-modal ego intention predictor and an ego-conditioned multi-agent joint trajectory predictor for automated parking and proposes counterfactual knowledge distillation, where an EMA teacher refined by a frozen safety-guided denoiser provides pseudo-targets that capture how agents react to alternative ego intentions.

Abstract

Automated parking is a challenging operational domain for advanced driver assistance systems, requiring robust scene understanding and interaction reasoning. The key challenge is twofold: (i) predict multiple plausible ego intentions according to context and (ii) for each intention, predict the joint responses of surrounding agents, enabling effective what-if decision-making. However, existing methods often fall short, typically treating these interdependent problems in isolation. We propose ParkDiffusion++, which jointly learns a multi-modal ego intention predictor and an ego-conditioned multi-agent joint trajectory predictor for automated parking. Our approach makes several key contributions. First, we introduce an ego intention tokenizer that predicts a small set of discrete endpoint intentions from agent histories and vectorized map polylines. Second, we perform ego-intention-conditioned joint prediction, yielding socially consistent predictions of the surrounding agents for each possible ego intention. Third, we employ a lightweight safety-guided denoiser with different constraints to refine joint scenes during training, thus improving accuracy and safety. Fourth, we propose counterfactual knowledge distillation, where an EMA teacher refined by a frozen safety-guided denoiser provides pseudo-targets that capture how agents react to alternative ego intentions. Extensive evaluations demonstrate that ParkDiffusion++ achieves state-of-the-art performance on the Dragon Lake Parking (DLP) dataset and the Intersections Drone (inD) dataset. Importantly, qualitative what-if visualizations show that other agents react appropriately to different ego intentions.

ParkDiffusion++: Ego Intention Conditioned Joint Multi-Agent Trajectory Prediction for Automated Parking using Diffusion Models

TL;DR

ParkDiffusion++ is proposed, which jointly learns a multi-modal ego intention predictor and an ego-conditioned multi-agent joint trajectory predictor for automated parking and proposes counterfactual knowledge distillation, where an EMA teacher refined by a frozen safety-guided denoiser provides pseudo-targets that capture how agents react to alternative ego intentions.

Abstract

Automated parking is a challenging operational domain for advanced driver assistance systems, requiring robust scene understanding and interaction reasoning. The key challenge is twofold: (i) predict multiple plausible ego intentions according to context and (ii) for each intention, predict the joint responses of surrounding agents, enabling effective what-if decision-making. However, existing methods often fall short, typically treating these interdependent problems in isolation. We propose ParkDiffusion++, which jointly learns a multi-modal ego intention predictor and an ego-conditioned multi-agent joint trajectory predictor for automated parking. Our approach makes several key contributions. First, we introduce an ego intention tokenizer that predicts a small set of discrete endpoint intentions from agent histories and vectorized map polylines. Second, we perform ego-intention-conditioned joint prediction, yielding socially consistent predictions of the surrounding agents for each possible ego intention. Third, we employ a lightweight safety-guided denoiser with different constraints to refine joint scenes during training, thus improving accuracy and safety. Fourth, we propose counterfactual knowledge distillation, where an EMA teacher refined by a frozen safety-guided denoiser provides pseudo-targets that capture how agents react to alternative ego intentions. Extensive evaluations demonstrate that ParkDiffusion++ achieves state-of-the-art performance on the Dragon Lake Parking (DLP) dataset and the Intersections Drone (inD) dataset. Importantly, qualitative what-if visualizations show that other agents react appropriately to different ego intentions.
Paper Structure (31 sections, 9 equations, 3 figures, 5 tables)

This paper contains 31 sections, 9 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Parking lot environments, characterized by their unstructured nature relative to on-road driving, require complex reasoning for autonomous navigation. The ego vehicle must perform what-if prediction, a process of evaluating multiple feasible ego future intentions while conditionally predicting the reactive behaviors of other agents to ensure safe and efficient decision making.
  • Figure 2: Overview of ParkDiffusion++. Given agent histories and a vectorized map with parking slots and obstacles, Stage 1 predicts a bank of ego endpoint tokens, with the ground truth token in red and others in green. During Stage 2 training, we denote $\mathrm{Decoder}_\theta$ as the student conditional joint predictor, and $\mathrm{Decoder}_{\bar{\theta}}$ as EMA-updated teacher of $\mathrm{Decoder}_\theta$. In the supervised learning branch, the ground truth token drives $\mathrm{Decoder}_\theta$ to yield a raw joint scene. We compute $\mathcal{L}_{\text{GT}}$ on the raw output and use the frozen Denoiser to form a stop‑gradient consistency target (not shown for simplicity). The Counterfactual Knowledge Distillation (CKD) module consists of a teacher and a student branch. The teacher branch runs $\mathrm{Decoder}_{\bar{\theta}}$ and refines its output with the same frozen Denoiser as supervised learning to form a teacher target, and the student branch runs $\mathrm{Decoder}_\theta$ and learns to match that teacher target featuring an additional safety penalty. The primary training objective of Stage 2 comprises the supervised learning loss and the counterfactual knowledge distillation loss ($\mathcal{L}_{\text{KD}}$). At inference (bottom row), a token from Stage 1 is selected and fed to $\mathrm{Decoder}_\theta$ to yield the final joint trajectory. Internal steps such as per-agent marginals, beam-based assembly of joint scenes, and scene selection are abstracted within this block for clarity.
  • Figure 3: Visualizations of ParkDiffusion++ predictions on the DLP (top row) and inD (bottom row) datasets. For both rows, the leftmost figure shows how our model reacts to the most likely intention (red star) from Stage 1. The other three figures in each row show the joint prediction result conditioned on other representative predicted intentions (green star). The ego vehicle is depicted in solid black at the center, while other agents are shown in gray. Trajectories are color-coded as follows: past in blue, ground-truth future in red, and predicted in green.