Table of Contents
Fetching ...

LLMs Can Check Their Own Results to Mitigate Hallucinations in Traffic Understanding Tasks

Malsha Ashani Mahawatta Dona, Beatriz Cabrero-Daniel, Yinan Yu, Christian Berger

TL;DR

This paper addresses hallucinations in multimodal LLM-based traffic understanding for automotive perception by adapting SelfCheckGPT to automotive data and evaluating GPT-4o, LLaVA, and Llama3 on Waymo and PREPER CITY datasets. The methodology involves generating captions from multiple LLMs, decomposing captions into sentences, and using a self-consistency check to filter potential hallucinations, with analyses of dataset and time-of-day effects. Key findings show GPT-4o generally achieves higher baseline fidelity but incurs more false positives than LLaVA, while the SelfCheckGPT adaptation improves hallucination filtering across configurations, with daytime imagery yielding better detection performance. The work demonstrates the practicality of self-consistency based hallucination mitigation in automotive perception tasks, highlights trade-offs between precision and recall, and outlines future work focused on vulnerable road users and robustness of prompts and data.

Abstract

Today's Large Language Models (LLMs) have showcased exemplary capabilities, ranging from simple text generation to advanced image processing. Such models are currently being explored for in-vehicle services such as supporting perception tasks in Advanced Driver Assistance Systems (ADAS) or Autonomous Driving (AD) systems, given the LLMs' capabilities to process multi-modal data. However, LLMs often generate nonsensical or unfaithful information, known as ``hallucinations'': a notable issue that needs to be mitigated. In this paper, we systematically explore the adoption of SelfCheckGPT to spot hallucinations by three state-of-the-art LLMs (GPT-4o, LLaVA, and Llama3) when analysing visual automotive data from two sources: Waymo Open Dataset, from the US, and PREPER CITY dataset, from Sweden. Our results show that GPT-4o is better at generating faithful image captions than LLaVA, whereas the former demonstrated leniency in mislabeling non-hallucinated content as hallucinations compared to the latter. Furthermore, the analysis of the performance metrics revealed that the dataset type (Waymo or PREPER CITY) did not significantly affect the quality of the captions or the effectiveness of hallucination detection. However, the models showed better performance rates over images captured during daytime, compared to during dawn, dusk or night. Overall, the results show that SelfCheckGPT and its adaptation can be used to filter hallucinations in generated traffic-related image captions for state-of-the-art LLMs.

LLMs Can Check Their Own Results to Mitigate Hallucinations in Traffic Understanding Tasks

TL;DR

This paper addresses hallucinations in multimodal LLM-based traffic understanding for automotive perception by adapting SelfCheckGPT to automotive data and evaluating GPT-4o, LLaVA, and Llama3 on Waymo and PREPER CITY datasets. The methodology involves generating captions from multiple LLMs, decomposing captions into sentences, and using a self-consistency check to filter potential hallucinations, with analyses of dataset and time-of-day effects. Key findings show GPT-4o generally achieves higher baseline fidelity but incurs more false positives than LLaVA, while the SelfCheckGPT adaptation improves hallucination filtering across configurations, with daytime imagery yielding better detection performance. The work demonstrates the practicality of self-consistency based hallucination mitigation in automotive perception tasks, highlights trade-offs between precision and recall, and outlines future work focused on vulnerable road users and robustness of prompts and data.

Abstract

Today's Large Language Models (LLMs) have showcased exemplary capabilities, ranging from simple text generation to advanced image processing. Such models are currently being explored for in-vehicle services such as supporting perception tasks in Advanced Driver Assistance Systems (ADAS) or Autonomous Driving (AD) systems, given the LLMs' capabilities to process multi-modal data. However, LLMs often generate nonsensical or unfaithful information, known as ``hallucinations'': a notable issue that needs to be mitigated. In this paper, we systematically explore the adoption of SelfCheckGPT to spot hallucinations by three state-of-the-art LLMs (GPT-4o, LLaVA, and Llama3) when analysing visual automotive data from two sources: Waymo Open Dataset, from the US, and PREPER CITY dataset, from Sweden. Our results show that GPT-4o is better at generating faithful image captions than LLaVA, whereas the former demonstrated leniency in mislabeling non-hallucinated content as hallucinations compared to the latter. Furthermore, the analysis of the performance metrics revealed that the dataset type (Waymo or PREPER CITY) did not significantly affect the quality of the captions or the effectiveness of hallucination detection. However, the models showed better performance rates over images captured during daytime, compared to during dawn, dusk or night. Overall, the results show that SelfCheckGPT and its adaptation can be used to filter hallucinations in generated traffic-related image captions for state-of-the-art LLMs.
Paper Structure (16 sections, 2 figures, 10 tables)

This paper contains 16 sections, 2 figures, 10 tables.

Figures (2)

  • Figure 1: SelfCheckGPT with LLM prompting. The LLM-generated sentences in a caption are compared against the remaining captions generated by the same LLM for the same prompt. The sentences that are supported by the other captions are considered to be non-hallucinated and this comparison is conducted by LLMs.
  • Figure 2: The experimental setup that depicts the adaptation of SelfCheckGPT. The LLM-generated sentences in a caption are compared with the remaining captions to identify the hallucinated sentences. Based on the sentence level consistency check, the sentences in the caption are filtered to create a refined version of the caption. Different checker and captioner LLMs are used in this setup.