Evaluating and Enhancing Trustworthiness of LLMs in Perception Tasks

Malsha Ashani Mahawatta Dona; Beatriz Cabrero-Daniel; Yinan Yu; Christian Berger

Evaluating and Enhancing Trustworthiness of LLMs in Perception Tasks

Malsha Ashani Mahawatta Dona, Beatriz Cabrero-Daniel, Yinan Yu, Christian Berger

TL;DR

The paper investigates trustworthiness of multimodal large language models for perception tasks in ADAS/AD, with a focus on hallucination detection in pedestrian detection. It compares GPT-4V and LLaVA on Waymo and PREPER CITY datasets, evaluating detection strategies including BO3, THV, THV-2, and a physical plausibility check, using both full images and ROI-based localization with temporal data. Findings show GPT-4V generally outperforms LLaVA, while BO3 offers limited benefits for open LLMs; leveraging historical frames and plausibility checks improves detection, albeit with model-specific variability. The work provides a rigorous evaluation pipeline, data curation, and actionable guidance for designing robust, LLM-enabled perception stacks in vehicles.

Abstract

Today's advanced driver assistance systems (ADAS), like adaptive cruise control or rear collision warning, are finding broader adoption across vehicle classes. Integrating such advanced, multimodal Large Language Models (LLMs) on board a vehicle, which are capable of processing text, images, audio, and other data types, may have the potential to greatly enhance passenger comfort. Yet, an LLM's hallucinations are still a major challenge to be addressed. In this paper, we systematically assessed potential hallucination detection strategies for such LLMs in the context of object detection in vision-based data on the example of pedestrian detection and localization. We evaluate three hallucination detection strategies applied to two state-of-the-art LLMs, the proprietary GPT-4V and the open LLaVA, on two datasets (Waymo/US and PREPER CITY/Sweden). Our results show that these LLMs can describe a traffic situation to an impressive level of detail but are still challenged for further analysis activities such as object localization. We evaluate and extend hallucination detection approaches when applying these LLMs to video sequences in the example of pedestrian detection. Our experiments show that, at the moment, the state-of-the-art proprietary LLM performs much better than the open LLM. Furthermore, consistency enhancement techniques based on voting, such as the Best-of-Three (BO3) method, do not effectively reduce hallucinations in LLMs that tend to exhibit high false negatives in detecting pedestrians. However, extending the hallucination detection by including information from the past helps to improve results.

Evaluating and Enhancing Trustworthiness of LLMs in Perception Tasks

TL;DR

Abstract

Paper Structure (16 sections, 4 figures, 3 tables)

This paper contains 16 sections, 4 figures, 3 tables.

Introduction
Problem Domain and Motivation
Research Goal and Research Questions
Contributions and Scope
Structure of the Paper
Related Work
Methodology
Dataset Curation and Preparation
Data Collection
Data Analysis
Results
Types of Hallucinations and LLM Performance on Unmodified Images (DA-1, DA-2)
BO3 Hallucination Detection across RoIs (DA-2)
Hallucination Detection using Historical Frames in an Automotive Context (DA-3 and DA-4)
Analysis and Discussion
...and 1 more sections

Figures (4)

Figure 1: Overview diagram of the experimental setup: All frames in each sequence of images are systematically cropped to remove the horizon and split into four RoIs to support the localization of pedestrians. All RoIs and full images are evaluated with the LLMs GPT-4V and LLaVA using the prompt: "Is there a human or part of a human in this image? Answer ONLY either 'yes' or 'no'." The LLMs' responses are compared against the GT collected by human annotations to analyze the three experiments.
Figure 2: Types of hallucinations: (a) False Negatives: The LLM is unable to detect the pedestrian in the left corner, highlighted by the yellow box. (b) False Positives: The LLM hallucinates a pedestrian in the image, whereas GT denotes that there is no human or a part of a human present in the image. (c) Other: The LLM refuses to process the picture, hallucinating some content that is not allowed by the safety system, resulting in a content policy violation.
Figure 3: Proportion of yes/no labels among the RoI in focus and the same RoI in the previous two frames. The upper graph shows the percentage of 'yes' labels in RoIs that show humans according to the GT, whereas the lower graph shows the percentage of 'no' labels when that RoI does not contain a pedestrian. The navy bar, representing GT labels, shows that the labels for RoIs tend to remain the same across three consecutive frames.
Figure 4: Representation of RoIs labelled as containing a pedestrian (rectangles) in three consecutive frames: the current time $t$, $t-1$, and $t-2$. Checking if the RoIs in frame $t$ labelled as containing a human are adjacent to the RoIs in $t-1$ and $t-2$ (DA-4) can help us detect physically impossible motions, potentially due to hallucinations in the LLM labelling.

Evaluating and Enhancing Trustworthiness of LLMs in Perception Tasks

TL;DR

Abstract

Evaluating and Enhancing Trustworthiness of LLMs in Perception Tasks

Authors

TL;DR

Abstract

Table of Contents

Figures (4)