Table of Contents
Fetching ...

VORD: Visual Ordinal Calibration for Mitigating Object Hallucinations in Large Vision-Language Models

Dexter Neo, Tsuhan Chen

TL;DR

VORDER addresses the persistent object hallucination problem in Large Vision-Language Models by introducing Visual Ordinal Calibration (VORD), which enforces ordinal relationships between token confidences derived from original and modified images. It offers two practical forms: a training-free VORD Decoding method and a trainable VORD Loss with an adaptive visual-similarity margin, both designed to suppress unlikely tokens while preserving informative ones. Across POPE, MME Hallucination Subset, and LLaVA-Bench, VORD improves calibration (lower ECE) and reduces hallucinations, achieving consistent gains over strong baselines. The work demonstrates that ordinal calibration can meaningfully enhance reliability and faithfulness of LVLM outputs, with potential extensions to NLP and broader grounding tasks.

Abstract

Large Vision-Language Models (LVLMs) have made remarkable developments along with the recent surge of large language models. Despite their advancements, LVLMs have a tendency to generate plausible yet inaccurate or inconsistent information based on the provided source content. This phenomenon, also known as ``hallucinations" can have serious downstream implications during the deployment of LVLMs. To address this, we present VORD a simple and effective method that alleviates hallucinations by calibrating token predictions based on ordinal relationships between modified image pairs. VORD is presented in two forms: 1.) a minimalist training-free variant which eliminates implausible tokens from modified image pairs, and 2.) a trainable objective function that penalizes unlikely tokens. Our experiments demonstrate that VORD delivers better calibration and effectively mitigates object hallucinations on a wide-range of LVLM benchmarks.

VORD: Visual Ordinal Calibration for Mitigating Object Hallucinations in Large Vision-Language Models

TL;DR

VORDER addresses the persistent object hallucination problem in Large Vision-Language Models by introducing Visual Ordinal Calibration (VORD), which enforces ordinal relationships between token confidences derived from original and modified images. It offers two practical forms: a training-free VORD Decoding method and a trainable VORD Loss with an adaptive visual-similarity margin, both designed to suppress unlikely tokens while preserving informative ones. Across POPE, MME Hallucination Subset, and LLaVA-Bench, VORD improves calibration (lower ECE) and reduces hallucinations, achieving consistent gains over strong baselines. The work demonstrates that ordinal calibration can meaningfully enhance reliability and faithfulness of LVLM outputs, with potential extensions to NLP and broader grounding tasks.

Abstract

Large Vision-Language Models (LVLMs) have made remarkable developments along with the recent surge of large language models. Despite their advancements, LVLMs have a tendency to generate plausible yet inaccurate or inconsistent information based on the provided source content. This phenomenon, also known as ``hallucinations" can have serious downstream implications during the deployment of LVLMs. To address this, we present VORD a simple and effective method that alleviates hallucinations by calibrating token predictions based on ordinal relationships between modified image pairs. VORD is presented in two forms: 1.) a minimalist training-free variant which eliminates implausible tokens from modified image pairs, and 2.) a trainable objective function that penalizes unlikely tokens. Our experiments demonstrate that VORD delivers better calibration and effectively mitigates object hallucinations on a wide-range of LVLM benchmarks.

Paper Structure

This paper contains 42 sections, 10 equations, 7 figures, 8 tables, 2 algorithms.

Figures (7)

  • Figure 1: VORD suppresses hallucinated objects such as <person> by enforcing ordinal ranking of confidences and penalizing unlikely tokens during the generation process.
  • Figure 2: Comparisons between tokens probabilities obtained from the original and modified image. We observe a non-ordinal relations between both sets of probability distributions. The prompt used is: "Describe this image in detail."
  • Figure 3: Visual corruptions, such as random noise and image mixing can introduce uncertainty into LVLMs. Our findings indicate that Mixup can be a particularly effective technique for inducing uncertainty, leading to more significant errors than diffusion noise.
  • Figure 4: VORD penalizes tokens with higher conditional probabilities from $\hat{v}$ than those from the original in $v$. The enforcement of the transitive property in VORD helps improve model performance, by filtering unlikely tokens during generation and training.
  • Figure 5: Our experiments demonstrate that finetuning LLaVA with VORD loss, particularly the squared variant $(\psi=2)$, highlighted in magenta, leads to consistent performance gains over the baseline. Moreover, combining VORD loss with VORD decoding (shaded) results in additional improvements.
  • ...and 2 more figures