Table of Contents
Fetching ...

Hallucination Score: Towards Mitigating Hallucinations in Generative Image Super-Resolution

Weiming Ren, Raghav Goyal, Zhiming Hu, Tristan Ty Aumentado-Armstrong, Iqbal Mohomed, Alex Levinshtein

TL;DR

This work takes advantage of multimodal large language models by constructing a prompt that assesses hallucinatory visual elements and generates a "Hallucination Score", which is found that HS is closely aligned with human evaluations, and also provides complementary insights to prior image metrics used for super-resolution (SR) models.

Abstract

Generative super-resolution (GSR) currently sets the state-of-the-art in terms of perceptual image quality, overcoming the "regression-to-the-mean" blur of prior non-generative models. However, from a human perspective, such models do not fully conform to the optimal balance between quality and fidelity. Instead, a different class of artifacts, in which generated details fail to perceptually match the low resolution image (LRI) or ground-truth image (GTI), is a critical but under-studied issue in GSR, limiting its practical deployment. In this work, we focus on measuring, analyzing, and mitigating these artifacts (i.e., "hallucinations"). We observe that hallucinations are not well-characterized with existing image metrics or quality models, as they are orthogonal to both exact fidelity and no-reference quality. Instead, we take advantage of multimodal large language models (MLLMs) by constructing a prompt that assesses hallucinatory visual elements and generates a "Hallucination Score" (HS). We find that HS is closely aligned with human evaluations, and also provides complementary insights to prior image metrics used for super-resolution (SR) models. Finally, we propose a few efficient HS proxies and demonstrate how diffusion-based GSR models can be fine-tuned to mitigate hallucinations, leveraging HS proxies as differentiable reward functions.

Hallucination Score: Towards Mitigating Hallucinations in Generative Image Super-Resolution

TL;DR

This work takes advantage of multimodal large language models by constructing a prompt that assesses hallucinatory visual elements and generates a "Hallucination Score", which is found that HS is closely aligned with human evaluations, and also provides complementary insights to prior image metrics used for super-resolution (SR) models.

Abstract

Generative super-resolution (GSR) currently sets the state-of-the-art in terms of perceptual image quality, overcoming the "regression-to-the-mean" blur of prior non-generative models. However, from a human perspective, such models do not fully conform to the optimal balance between quality and fidelity. Instead, a different class of artifacts, in which generated details fail to perceptually match the low resolution image (LRI) or ground-truth image (GTI), is a critical but under-studied issue in GSR, limiting its practical deployment. In this work, we focus on measuring, analyzing, and mitigating these artifacts (i.e., "hallucinations"). We observe that hallucinations are not well-characterized with existing image metrics or quality models, as they are orthogonal to both exact fidelity and no-reference quality. Instead, we take advantage of multimodal large language models (MLLMs) by constructing a prompt that assesses hallucinatory visual elements and generates a "Hallucination Score" (HS). We find that HS is closely aligned with human evaluations, and also provides complementary insights to prior image metrics used for super-resolution (SR) models. Finally, we propose a few efficient HS proxies and demonstrate how diffusion-based GSR models can be fine-tuned to mitigate hallucinations, leveraging HS proxies as differentiable reward functions.

Paper Structure

This paper contains 46 sections, 21 figures, 10 tables.

Figures (21)

  • Figure 1: Hallucination score for image super-resolution. The outputs of state-of-the-art super-resolution (SR) models (e.g., SeeSR wu2024seesr and PASD yang2024pasd) often contain significant hallucinations, as seen in the example images above. For each example set, we show the outputs of two SR models and the preference of a given metric for each output, via a green checkmark in its row; for instance, in the left inset, LPIPS prefers the SeeSR output, while SSIM favours the PASD one. While human evaluators and our proposed hallucination score (HS) can identify hallucinatory outputs, traditional metrics (PSNR, SSIM, MUSIQ, and LPIPS) often fail to do so. Further, notice that the HS does not always align with existing metrics, as it captures complementary aspects of SR quality.
  • Figure 2: Examples of hallucinations. Top: SeeSR outputs wu2024seesr; bottom: zoom-ins of SR (left) with GT (right). From left to right, we see: (i) incorrect semantics, wrongly adding feathers to the stone; (ii) visually jarring scene alterations, despite coarse semantic preservation; and (iii) textual artifacts. Notice the textures appear realistic and sharp, but are perceptually unappealing.
  • Figure 3: Illustration of our hallucination definition. Property P1 defines SRI content as hallucinatory if it cannot be plausibly degraded into LRI content. Property P2 considers a continuum from blurred content (due to uncertainty) and/or innocuous detail changes (less hallucinatory) to perceptually salient and/or semantically severe distortions (highly hallucinatory).
  • Figure 4: Generating hallucination scores with GPT-4o. We construct a prompt comprising three essential parts: task introduction, evaluation criteria, and output format. This detailed prompt is then combined with input images and fed into the MLLM model (GPT-4o hurst2024gpt) to obtain hallucination scores and accompanying explanations. The full prompt can be found in Supp. Fig. \ref{['fig:promptfull']}.
  • Figure 5: Qualitative examples of our MLLM-based hallucination score. In this figure, we show six example outputs from the MLLM given the LRI (top-left), GTI (top-right), SRI (bottom) and the prompt as inputs. Each output includes a numerical score on a 1-5 scale with detailed explanations justifying the assigned score. The results demonstrate the MLLM's ability to effectively identify critical hallucination issues in each image and assign accurate hallucination scores accordingly.
  • ...and 16 more figures