Table of Contents
Fetching ...

RITUAL: Random Image Transformations as a Universal Anti-hallucination Lever in Large Vision Language Models

Sangmin Woo, Jaehyuk Jang, Donguk Kim, Yubin Choi, Changick Kim

TL;DR

RITUAL tackles LVLM hallucinations by decoding with dual inputs: the original image and randomly transformed variants, enabling the model to reconcile conflicting visual cues without additional training. The method ensembles the conditional distributions from both views, effectively reducing hallucinations while maintaining text quality, and is extended by RITUAL+, which uses self-feedback to select transformations adaptively. Empirical results across POPE, CHAIR, MME-Hallucination, and MME-Fullset show RITUAL and RITUAL+ outperform contrastive decoding baselines, sometimes rivaling beam-search methods with lower latency. This training-free, model-agnostic approach offers a practical and robust lever to improve reliability and trustworthiness of large vision-language systems in real-world use.

Abstract

Recent advancements in Large Vision Language Models (LVLMs) have revolutionized how machines understand and generate textual responses based on visual inputs, yet they often produce "hallucinatory" outputs that misinterpret visual information, posing challenges in reliability and trustworthiness. We propose RITUAL, a simple decoding method that reduces hallucinations by leveraging randomly transformed images as complementary inputs during decoding, adjusting the output probability distribution without additional training or external models. Our key insight is that random transformations expose the model to diverse visual perspectives, enabling it to correct misinterpretations that lead to hallucinations. Specifically, when a model hallucinates based on the original image, the transformed images -- altered in aspects such as orientation, scale, or color -- provide alternative viewpoints that help recalibrate the model's predictions. By integrating the probability distributions from both the original and transformed images, RITUAL effectively reduces hallucinations. To further improve reliability and address potential instability from arbitrary transformations, we introduce RITUAL+, an extension that selects image transformations based on self-feedback from the LVLM. Instead of applying transformations randomly, RITUAL+ uses the LVLM to evaluate and choose transformations that are most beneficial for reducing hallucinations in a given context. This self-adaptive approach mitigates the potential negative impact of certain transformations on specific tasks, ensuring more consistent performance across different scenarios. Experiments demonstrate that RITUAL and RITUAL+ significantly reduce hallucinations across several object hallucination benchmarks.

RITUAL: Random Image Transformations as a Universal Anti-hallucination Lever in Large Vision Language Models

TL;DR

RITUAL tackles LVLM hallucinations by decoding with dual inputs: the original image and randomly transformed variants, enabling the model to reconcile conflicting visual cues without additional training. The method ensembles the conditional distributions from both views, effectively reducing hallucinations while maintaining text quality, and is extended by RITUAL+, which uses self-feedback to select transformations adaptively. Empirical results across POPE, CHAIR, MME-Hallucination, and MME-Fullset show RITUAL and RITUAL+ outperform contrastive decoding baselines, sometimes rivaling beam-search methods with lower latency. This training-free, model-agnostic approach offers a practical and robust lever to improve reliability and trustworthiness of large vision-language systems in real-world use.

Abstract

Recent advancements in Large Vision Language Models (LVLMs) have revolutionized how machines understand and generate textual responses based on visual inputs, yet they often produce "hallucinatory" outputs that misinterpret visual information, posing challenges in reliability and trustworthiness. We propose RITUAL, a simple decoding method that reduces hallucinations by leveraging randomly transformed images as complementary inputs during decoding, adjusting the output probability distribution without additional training or external models. Our key insight is that random transformations expose the model to diverse visual perspectives, enabling it to correct misinterpretations that lead to hallucinations. Specifically, when a model hallucinates based on the original image, the transformed images -- altered in aspects such as orientation, scale, or color -- provide alternative viewpoints that help recalibrate the model's predictions. By integrating the probability distributions from both the original and transformed images, RITUAL effectively reduces hallucinations. To further improve reliability and address potential instability from arbitrary transformations, we introduce RITUAL+, an extension that selects image transformations based on self-feedback from the LVLM. Instead of applying transformations randomly, RITUAL+ uses the LVLM to evaluate and choose transformations that are most beneficial for reducing hallucinations in a given context. This self-adaptive approach mitigates the potential negative impact of certain transformations on specific tasks, ensuring more consistent performance across different scenarios. Experiments demonstrate that RITUAL and RITUAL+ significantly reduce hallucinations across several object hallucination benchmarks.
Paper Structure (41 sections, 9 equations, 13 figures, 18 tables)

This paper contains 41 sections, 9 equations, 13 figures, 18 tables.

Figures (13)

  • Figure 1: RITUAL: A simple yet effective anti-hallucination approach for LVLMs. Our RITUAL method leverages basic image transformations (e.g., vertical and horizontal flips) to enhance LVLM accuracy without external models or training. By integrating transformed and original images, RITUAL significantly reduces hallucinations in both discriminative tasks and descriptive tasks. Using both versions together enables the model to refine predictions, reducing errors and boosting correct responses.
  • Figure 2: Overview of RITUAL and RITUAL+. In RITUAL, the original image $\mathcal{V}$ undergoes random transformations, generating a transformed image $\mathcal{V}^{(T)}$. In RITUAL+, the model evaluates various potential transformations and selects the most beneficial one to improve answer accuracy within the given context, further refining reliability. These transformed images serve as complementary inputs, enabling the model to incorporate multiple visual perspectives to reduce hallucinations.
  • Figure 3: Comparison on MME-Fullset fu2024mme. RITUAL significantly enhances the general vision-language capabilities of LVLMs across wide range of tasks. When equipped with RITUAL, LLaVA-1.5 liu2023visual achieves top performance in 12 of the 14 categories, while InstructBLIP dai2024instructblip leads in 8 categories and mPLUG-Owl2 ye2024mplug ranks highest in 9 categories. Detailed results are in Appendix.
  • Figure 4: Impact of the number of augmented images in RITUAL.
  • Figure 5: Impact of combining original and transformed images.
  • ...and 8 more figures