Table of Contents
Fetching ...

What-If Motion Prediction for Autonomous Driving

Siddhesh Khandelwal, William Qi, Jagjeet Singh, Andrew Hartnett, Deva Ramanan

TL;DR

The paper addresses long-horizon motion forecasting for autonomous driving by introducing WIMP, a recurrent graph-based model that jointly leverages interpretable road-network polylines and social actor interactions. It enables conditional, counterfactual forecasting conditioned on hypothetical polylines and social contexts, producing diverse multi-modal predictions with a likelihood-aware decoding strategy. Empirical results on Argoverse and NuScenes show state-of-the-art or competitive performance, with ablations confirming the value of combining map and social context under an EWTA-driven multi-predictor regime. This approach enhances planner integration by providing controllable, interpretable forecasts and a mechanism to reason about unobserved or unlikely futures relevant to the AV's plan.

Abstract

Forecasting the long-term future motion of road actors is a core challenge to the deployment of safe autonomous vehicles (AVs). Viable solutions must account for both the static geometric context, such as road lanes, and dynamic social interactions arising from multiple actors. While recent deep architectures have achieved state-of-the-art performance on distance-based forecasting metrics, these approaches produce forecasts that are predicted without regard to the AV's intended motion plan. In contrast, we propose a recurrent graph-based attentional approach with interpretable geometric (actor-lane) and social (actor-actor) relationships that supports the injection of counterfactual geometric goals and social contexts. Our model can produce diverse predictions conditioned on hypothetical or "what-if" road lanes and multi-actor interactions. We show that such an approach could be used in the planning loop to reason about unobserved causes or unlikely futures that are directly relevant to the AV's intended route.

What-If Motion Prediction for Autonomous Driving

TL;DR

The paper addresses long-horizon motion forecasting for autonomous driving by introducing WIMP, a recurrent graph-based model that jointly leverages interpretable road-network polylines and social actor interactions. It enables conditional, counterfactual forecasting conditioned on hypothetical polylines and social contexts, producing diverse multi-modal predictions with a likelihood-aware decoding strategy. Empirical results on Argoverse and NuScenes show state-of-the-art or competitive performance, with ablations confirming the value of combining map and social context under an EWTA-driven multi-predictor regime. This approach enhances planner integration by providing controllable, interpretable forecasts and a mechanism to reason about unobserved or unlikely futures relevant to the AV's plan.

Abstract

Forecasting the long-term future motion of road actors is a core challenge to the deployment of safe autonomous vehicles (AVs). Viable solutions must account for both the static geometric context, such as road lanes, and dynamic social interactions arising from multiple actors. While recent deep architectures have achieved state-of-the-art performance on distance-based forecasting metrics, these approaches produce forecasts that are predicted without regard to the AV's intended motion plan. In contrast, we propose a recurrent graph-based attentional approach with interpretable geometric (actor-lane) and social (actor-actor) relationships that supports the injection of counterfactual geometric goals and social contexts. Our model can produce diverse predictions conditioned on hypothetical or "what-if" road lanes and multi-actor interactions. We show that such an approach could be used in the planning loop to reason about unobserved causes or unlikely futures that are directly relevant to the AV's intended route.

Paper Structure

This paper contains 18 sections, 7 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: While many feasible futures may exist for a given actor, only a small subset may be relevant to the AV's planner. In (a), neither of the dominant predicted modes (solid red) interact with the AV's intended trajectory (solid grey). Instead, the planner only needs to consider an illegal left turn across traffic (dashed red). (b) depicts a partial set of lane segments within the scene; illegal maneuvers such as following segment $b$ can either be mapped or hallucinated. A centerline (centered polyline) associated with a lane segment is shown in segment $f$ (dashed black). The planner can utilize the directed lane graph (c) to identify lanes which may interact with its intended route. Black arrows denote directed edges, while thick grey undirected edges denote conflicting lanes. Such networks are readily available in open street map APIs haklay2008openstreetmap and the recently-released Argoverse Argoverse dataset.
  • Figure 2: Overview of the data flow within the WIMP encoder-decoder architecture (left) and polyline attention module (right). Input trajectories and reference polylines are first used to compute per-actor embeddings; social context is then incorporated via graph attention. Finally, a set of predictions is generated using a map-aware decoder that attends to relevant regions of the polyline via soft-attention.
  • Figure 3: Visualizing the map lane polyline attention weights generated during decoding. In the scenario depicted in (a), the focal actor's history is shown in yellow and its ground-truth future in red. The red circle highlights the true state 3s into the future. The solid green line denotes a predicted trajectory with a black chevron marking the $t=+3s$ state. The dashed green line shows the reference polyline. Grey cars/circles illustrate the current positions of on/off roadway actors. In (b, c, d), opacity corresponds to the magnitude of social attention. The subset of the polyline selected by the polyline attention module is shown in solid blue (points denoted as black circles), and the attention weights within that segment are shown via an ellipse (for predictions at $t=+0s, +1s, +2s$ respectively). Points outside the ellipse have negligible attention. WIMP learns to attend smoothly to upcoming points along the reference polyline.
  • Figure 4: Visualizations of two prediction scenarios that condition on (a) heuristically-selected polylines (see Appendix \ref{['apdx:polyline']} for details) and corresponding (b) counterfactual reference polylines. When making diverse predictions, WIMP learns to generate some trajectories independent of the conditioning polyline (see the straight through predictions in (a)). Additionally, if the reference polyline is semantically or geometrically incompatible with the observed scene history (as in (2b) where the counterfactual polyline intersects other actors), the model learns to ignore the map input, relying only on social and historical context. Visualization style follows Fig. \ref{['fig:vizattention']}.
  • Figure 5: Visualizations of two scenarios that condition on (a) ground-truth scene context and (b) counterfactual social contexts (best viewed with magnification). Counterfactual actors are highlighted with a grey circle. In (1b), we inject a stopped vehicle just beyond the intersection, blocking the ground-truth right turn. Given the focal agent's history and velocity, this makes a right turn extremely unlikely, and that mode vanishes. In (2b) we replace the the leading actor in (2a) with a stopped vehicle. As expected, this causes the model to predict trajectories containing aggressive deceleration. The final velocity ($v_f$) of a representative trajectory is 3.3m/s in the counterfactual setting, compared with 10.3m/s in the original scene. Visualization style follows Fig. \ref{['fig:vizattention']}.
  • ...and 1 more figures