Table of Contents
Fetching ...

Hallucination Elimination and Semantic Enhancement Framework for Vision-Language Models in Traffic Scenarios

Jiaqi Fan, Jianhua Wu, Hongqing Chu, Quanbo Ge, Bingzhao Gao

TL;DR

The paper tackles object hallucinations in vision-language models applied to traffic scenes. It introduces HCOENet, a training-free chain-of-thought correction framework comprising a hallucination cross-checking pathway and a critical-object enhancement pathway, designed to filter erroneous entities and supplement overlooked objects. The approach leverages a cascade of off-the-shelf models (BLIP-2, InstructBLIP, RAM++, Grounding-DINO-B, Llama-3.1-8B) to produce refined, semantically rich descriptions, and furthermore enables automatic generation of CODA_desc and nuScenes_desc datasets. On the POPE benchmark, HCOENet yields substantial F1-score improvements for several LVLMs and achieves performance comparable to GPT-4o at a markedly lower cost, demonstrating practical value for safer autonomous driving. The work also delivers two traffic-scene semantic datasets and provides detailed ablations and analysis to validate the effectiveness and efficiency of the framework across diverse models and settings.

Abstract

Large vision-language models (LVLMs) have demonstrated remarkable capabilities in multimodal understanding and generation tasks. However, these models occasionally generate hallucinatory texts, resulting in descriptions that seem reasonable but do not correspond to the image. This phenomenon can lead to wrong driving decisions of the autonomous driving system. To address this challenge, this paper proposes HCOENet, a plug-and-play chain-of-thought correction method designed to eliminate object hallucinations and generate enhanced descriptions for critical objects overlooked in the initial response. Specifically, HCOENet employs a cross-checking mechanism to filter entities and directly extracts critical objects from the given image, enriching the descriptive text. Experimental results on the POPE benchmark demonstrate that HCOENet improves the F1-score of the Mini-InternVL-4B and mPLUG-Owl3 models by 12.58% and 4.28%, respectively. Additionally, qualitative results using images collected in open campus scene further highlight the practical applicability of the proposed method. Compared with the GPT-4o model, HCOENet achieves comparable descriptive performance while significantly reducing costs. Finally, two novel semantic understanding datasets, CODA_desc and nuScenes_desc, are created for traffic scenarios to support future research. The codes and datasets are publicly available at https://github.com/fjq-tongji/HCOENet.

Hallucination Elimination and Semantic Enhancement Framework for Vision-Language Models in Traffic Scenarios

TL;DR

The paper tackles object hallucinations in vision-language models applied to traffic scenes. It introduces HCOENet, a training-free chain-of-thought correction framework comprising a hallucination cross-checking pathway and a critical-object enhancement pathway, designed to filter erroneous entities and supplement overlooked objects. The approach leverages a cascade of off-the-shelf models (BLIP-2, InstructBLIP, RAM++, Grounding-DINO-B, Llama-3.1-8B) to produce refined, semantically rich descriptions, and furthermore enables automatic generation of CODA_desc and nuScenes_desc datasets. On the POPE benchmark, HCOENet yields substantial F1-score improvements for several LVLMs and achieves performance comparable to GPT-4o at a markedly lower cost, demonstrating practical value for safer autonomous driving. The work also delivers two traffic-scene semantic datasets and provides detailed ablations and analysis to validate the effectiveness and efficiency of the framework across diverse models and settings.

Abstract

Large vision-language models (LVLMs) have demonstrated remarkable capabilities in multimodal understanding and generation tasks. However, these models occasionally generate hallucinatory texts, resulting in descriptions that seem reasonable but do not correspond to the image. This phenomenon can lead to wrong driving decisions of the autonomous driving system. To address this challenge, this paper proposes HCOENet, a plug-and-play chain-of-thought correction method designed to eliminate object hallucinations and generate enhanced descriptions for critical objects overlooked in the initial response. Specifically, HCOENet employs a cross-checking mechanism to filter entities and directly extracts critical objects from the given image, enriching the descriptive text. Experimental results on the POPE benchmark demonstrate that HCOENet improves the F1-score of the Mini-InternVL-4B and mPLUG-Owl3 models by 12.58% and 4.28%, respectively. Additionally, qualitative results using images collected in open campus scene further highlight the practical applicability of the proposed method. Compared with the GPT-4o model, HCOENet achieves comparable descriptive performance while significantly reducing costs. Finally, two novel semantic understanding datasets, CODA_desc and nuScenes_desc, are created for traffic scenarios to support future research. The codes and datasets are publicly available at https://github.com/fjq-tongji/HCOENet.

Paper Structure

This paper contains 30 sections, 10 figures, 10 tables, 2 algorithms.

Figures (10)

  • Figure 1: Examples of hallucinations in the traffic scenario, with hallucinatory descriptions highlighted in red, critical object ignored in the description highlighted in pink, and wrong driving decisions highlighted in blue.
  • Figure 2: (a) Process of the image-text annotation of HCOENet. (b) The results of HCOENet for eliminating hallucinations on the POPE benchmark.
  • Figure 3: The overall structure of the proposed HCOENet method. Here, the LVLM represents the LLaVA-1.5, mPLUG-Owl2, MiniGPT-4, Mini-InternVL-4B, etc. The hallucinatory texts are highlighted in red, and newly generated semantic contents are highlighted in blue.
  • Figure 4: Few-shot prompting design for the key entity extraction stage and hallucination correction stage.
  • Figure 5: Hallucination evaluation process of the POPE benchmark.
  • ...and 5 more figures