Table of Contents
Fetching ...

Uncertainty-Guided Enhancement on Driving Perception System via Foundation Models

Yunhao Yang, Yuxin Hu, Mao Ye, Zaiwei Zhang, Zhichao Lu, Yi Xu, Ufuk Topcu, Ben Snyder

TL;DR

This work tackles the cost and latency challenges of applying multimodal foundation models to driving perception by introducing an uncertainty-guided selective refinement framework. It calibrates perception and foundation-model confidences into probabilistic guarantees via conformal prediction and uses a temporal inference module to tighten these bounds across frames. A triggering mechanism decides when to query a foundation model and when to replace perception-model predictions based on guaranteed reliability, achieving 10–15% accuracy gains with roughly 50% fewer foundation-model queries, and an extra ~5% improvement when temporal information is incorporated. The approach is validated on the NuScenes dataset with UniAD as the perception model and GPT-4o-mini as the foundation model, demonstrating robust performance across scenarios and weather conditions and providing a principled uncertainty quantification for downstream decision-making.

Abstract

Multimodal foundation models offer promising advancements for enhancing driving perception systems, but their high computational and financial costs pose challenges. We develop a method that leverages foundation models to refine predictions from existing driving perception models -- such as enhancing object classification accuracy -- while minimizing the frequency of using these resource-intensive models. The method quantitatively characterizes uncertainties in the perception model's predictions and engages the foundation model only when these uncertainties exceed a pre-specified threshold. Specifically, it characterizes uncertainty by calibrating the perception model's confidence scores into theoretical lower bounds on the probability of correct predictions using conformal prediction. Then, it sends images to the foundation model and queries for refining the predictions only if the theoretical bound of the perception model's outcome is below the threshold. Additionally, we propose a temporal inference mechanism that enhances prediction accuracy by integrating historical predictions, leading to tighter theoretical bounds. The method demonstrates a 10 to 15 percent improvement in prediction accuracy and reduces the number of queries to the foundation model by 50 percent, based on quantitative evaluations from driving datasets.

Uncertainty-Guided Enhancement on Driving Perception System via Foundation Models

TL;DR

This work tackles the cost and latency challenges of applying multimodal foundation models to driving perception by introducing an uncertainty-guided selective refinement framework. It calibrates perception and foundation-model confidences into probabilistic guarantees via conformal prediction and uses a temporal inference module to tighten these bounds across frames. A triggering mechanism decides when to query a foundation model and when to replace perception-model predictions based on guaranteed reliability, achieving 10–15% accuracy gains with roughly 50% fewer foundation-model queries, and an extra ~5% improvement when temporal information is incorporated. The approach is validated on the NuScenes dataset with UniAD as the perception model and GPT-4o-mini as the foundation model, demonstrating robust performance across scenarios and weather conditions and providing a principled uncertainty quantification for downstream decision-making.

Abstract

Multimodal foundation models offer promising advancements for enhancing driving perception systems, but their high computational and financial costs pose challenges. We develop a method that leverages foundation models to refine predictions from existing driving perception models -- such as enhancing object classification accuracy -- while minimizing the frequency of using these resource-intensive models. The method quantitatively characterizes uncertainties in the perception model's predictions and engages the foundation model only when these uncertainties exceed a pre-specified threshold. Specifically, it characterizes uncertainty by calibrating the perception model's confidence scores into theoretical lower bounds on the probability of correct predictions using conformal prediction. Then, it sends images to the foundation model and queries for refining the predictions only if the theoretical bound of the perception model's outcome is below the threshold. Additionally, we propose a temporal inference mechanism that enhances prediction accuracy by integrating historical predictions, leading to tighter theoretical bounds. The method demonstrates a 10 to 15 percent improvement in prediction accuracy and reduces the number of queries to the foundation model by 50 percent, based on quantitative evaluations from driving datasets.
Paper Structure (17 sections, 3 equations, 8 figures, 1 table)

This paper contains 17 sections, 3 equations, 8 figures, 1 table.

Figures (8)

  • Figure 1: Pipeline of the uncertainty-guided perception system enhancement: (1) The driving perception model takes image-based observations and returns predictions of the observed objects with confidence scores. The predictions include categories, attributes (move or stop), and tracking information (same object across multiple frames). (2) Calibrate each confidence score into a probabilistic guarantee $G_p$. (3) If a probabilistic guarantee $G_p$ is lower than a user-specified threshold $T$, trigger a foundation model to obtain another object prediction. Otherwise, use the perception model's prediction as the final prediction for downstream tasks. (4) Calibrate the foundation model's confidences into probabilistic guarantees $G_v$. (5) If $G_v > G_p$, refine the perception model's prediction with the foundation model's prediction.
  • Figure 2: A mechanism that obtains predictions and probabilistic guarantees from objects' temporal information (object tracking and category) across multiple time frames. This mechanism is embedded in Step (2) in \ref{['fig: pipeline']}.
  • Figure 3: Nonconformity distributions for the UniAD and GPT-4o-mini. We use these distributions from left to right to estimate the probability density functions $f_c, f_a, f_t$ described in \ref{['sec: calibration']} and $f_v$ in \ref{['sec: trigger']}. Then, we can use \ref{['eq: calibrate']} or \ref{['eq: temporal-guarantee']} for confidence calibration.
  • Figure 4: Failure perceptions without temporal inference. Due to occlusions, the left example is misclassified as "pedestrian" and the right example is misclassified as "moving."
  • Figure 5: The frequency of querying the foundation model versus the accuracy of category/attribute predictions. We can improve the attribute accuracy by 10 percent and category accuracy by 3 percent while halving the querying frequency.
  • ...and 3 more figures

Theorems & Definitions (2)

  • Definition 1
  • proof