Hard Cases Detection in Motion Prediction by Vision-Language Foundation Models

Yi Yang; Qingwen Zhang; Kei Ikemura; Nazre Batool; John Folkesson

Hard Cases Detection in Motion Prediction by Vision-Language Foundation Models

Yi Yang, Qingwen Zhang, Kei Ikemura, Nazre Batool, John Folkesson

TL;DR

Hard-case scenarios in motion prediction for autonomous driving are rare and safety-critical, requiring diverse data. The authors propose leveraging Vision-Language Foundation Models, specifically GPT-4v, to detect hard cases from sequences of camera frames via designed prompts, at both agent and scene levels. They validate this detection by comparing VLM-derived rankings against ground-truth rankings from state-of-the-art predictors on NuScenes, using ranking metrics. Furthermore, they show that VLM-driven data selection can improve training efficiency, enabling effective learning with substantially smaller subsets of data.

Abstract

Addressing hard cases in autonomous driving, such as anomalous road users, extreme weather conditions, and complex traffic interactions, presents significant challenges. To ensure safety, it is crucial to detect and manage these scenarios effectively for autonomous driving systems. However, the rarity and high-risk nature of these cases demand extensive, diverse datasets for training robust models. Vision-Language Foundation Models (VLMs) have shown remarkable zero-shot capabilities as being trained on extensive datasets. This work explores the potential of VLMs in detecting hard cases in autonomous driving. We demonstrate the capability of VLMs such as GPT-4v in detecting hard cases in traffic participant motion prediction on both agent and scenario levels. We introduce a feasible pipeline where VLMs, fed with sequential image frames with designed prompts, effectively identify challenging agents or scenarios, which are verified by existing prediction models. Moreover, by taking advantage of this detection of hard cases by VLMs, we further improve the training efficiency of the existing motion prediction pipeline by performing data selection for the training samples suggested by GPT. We show the effectiveness and feasibility of our pipeline incorporating VLMs with state-of-the-art methods on NuScenes datasets. The code is accessible at https://github.com/KTH-RPL/Detect_VLM.

Hard Cases Detection in Motion Prediction by Vision-Language Foundation Models

TL;DR

Abstract

Paper Structure (18 sections, 4 figures, 3 tables)

This paper contains 18 sections, 4 figures, 3 tables.

Introduction
Related Work
Motion Prediction Using Camera Images
Challenging Cases Estimation in Motion Prediction
LLMs and VLMs for Autonomous Driving Systems
Methodology
Experiments
Datasets and Networks
Evaluation Metrics
Concordance Index (C-index)
Kendall's Tau
Top-K Accuracy
Normalized Discounted Cumulative Gain (NDCG)
Quantitative Result of Agents' Rankings
Ablation Study
...and 3 more sections

Figures (4)

Figure 1: Two-stages of evaluations.Stage 1: Verify the ability of VLM to detect hard cases, using existing motion prediction results as ground truth. We examine if the VLM's prediction of the most difficult-to-predict agents matches the order based on the highest displacement error in existing motion prediction networks. Stage 2: Improve training efficiency by training the network with a smaller subset of data selected by VLM.
Figure 2: Result of agents ranking according to higher prediction error / difficulty. Using the UniAD uniad ranking as ground truth, we compare it with random order, order from ViP3D vip3d, and GPT-4v. The evaluation is conducted using four metrics: C-index, NDCG, top-5 accuracy, and Kendall's Tau, where larger values indicate a higher correlation with UniAD order. The x-axis is the metric value. Note that for the random ordering, we conducted 10,000 trials, and the distribution of the results is shown in the blue histogram, with the y-axis representing the probability density / frequency; note that the metric value of random is the mean from all trials. The percentage values above the graph indicate the percentage of random trials that are surpassed by this value (cumulative probability).
Figure 3: Histogram of difficulty levels estimated by GPT-4v for 446 scenes.
Figure 4: GPT-4v scores the difficulty level for variable scenarios. Here we show four scenes scored as {9, 7, 5, 3} with explanations outputted by GPT-4v. A higher score means more difficulty.

Hard Cases Detection in Motion Prediction by Vision-Language Foundation Models

TL;DR

Abstract

Hard Cases Detection in Motion Prediction by Vision-Language Foundation Models

Authors

TL;DR

Abstract

Table of Contents

Figures (4)