VORD: Visual Ordinal Calibration for Mitigating Object Hallucinations in Large Vision-Language Models
Dexter Neo, Tsuhan Chen
TL;DR
VORDER addresses the persistent object hallucination problem in Large Vision-Language Models by introducing Visual Ordinal Calibration (VORD), which enforces ordinal relationships between token confidences derived from original and modified images. It offers two practical forms: a training-free VORD Decoding method and a trainable VORD Loss with an adaptive visual-similarity margin, both designed to suppress unlikely tokens while preserving informative ones. Across POPE, MME Hallucination Subset, and LLaVA-Bench, VORD improves calibration (lower ECE) and reduces hallucinations, achieving consistent gains over strong baselines. The work demonstrates that ordinal calibration can meaningfully enhance reliability and faithfulness of LVLM outputs, with potential extensions to NLP and broader grounding tasks.
Abstract
Large Vision-Language Models (LVLMs) have made remarkable developments along with the recent surge of large language models. Despite their advancements, LVLMs have a tendency to generate plausible yet inaccurate or inconsistent information based on the provided source content. This phenomenon, also known as ``hallucinations" can have serious downstream implications during the deployment of LVLMs. To address this, we present VORD a simple and effective method that alleviates hallucinations by calibrating token predictions based on ordinal relationships between modified image pairs. VORD is presented in two forms: 1.) a minimalist training-free variant which eliminates implausible tokens from modified image pairs, and 2.) a trainable objective function that penalizes unlikely tokens. Our experiments demonstrate that VORD delivers better calibration and effectively mitigates object hallucinations on a wide-range of LVLM benchmarks.
