Table of Contents
Fetching ...

ReFIR: Grounding Large Restoration Models with Retrieval Augmentation

Hang Guo, Tao Dai, Zhihao Ouyang, Taolin Zhang, Yaohua Zha, Bin Chen, Shu-tao Xia

TL;DR

This paper tackles hallucination in diffusion-based large restoration models (LRMs) by introducing ReFIR, a training-free Retrieval-Augmented framework that leverages retrieved high-quality reference images. It couples a nearest-neighbor reference retriever with a cross-image injection mechanism that uses separate attention, spatial adaptive gating, and distribution alignment to transfer textures from references into the detail-texture restoration stage of LRMs. The approach is model-agnostic and shows consistent improvements in both fidelity and perceptual quality across datasets without retraining, validating the effectiveness of external knowledge grounding in image restoration. The practical impact is significant: ReFIR can be plugged into existing LRMs to reduce hallucinations while maintaining efficiency and generality, enabling higher-fidelity restoration in challenging real-world degradations.

Abstract

Recent advances in diffusion-based Large Restoration Models (LRMs) have significantly improved photo-realistic image restoration by leveraging the internal knowledge embedded within model weights. However, existing LRMs often suffer from the hallucination dilemma, i.e., producing incorrect contents or textures when dealing with severe degradations, due to their heavy reliance on limited internal knowledge. In this paper, we propose an orthogonal solution called the Retrieval-augmented Framework for Image Restoration (ReFIR), which incorporates retrieved images as external knowledge to extend the knowledge boundary of existing LRMs in generating details faithful to the original scene. Specifically, we first introduce the nearest neighbor lookup to retrieve content-relevant high-quality images as reference, after which we propose the cross-image injection to modify existing LRMs to utilize high-quality textures from retrieved images. Thanks to the additional external knowledge, our ReFIR can well handle the hallucination challenge and facilitate faithfully results. Extensive experiments demonstrate that ReFIR can achieve not only high-fidelity but also realistic restoration results. Importantly, our ReFIR requires no training and is adaptable to various LRMs.

ReFIR: Grounding Large Restoration Models with Retrieval Augmentation

TL;DR

This paper tackles hallucination in diffusion-based large restoration models (LRMs) by introducing ReFIR, a training-free Retrieval-Augmented framework that leverages retrieved high-quality reference images. It couples a nearest-neighbor reference retriever with a cross-image injection mechanism that uses separate attention, spatial adaptive gating, and distribution alignment to transfer textures from references into the detail-texture restoration stage of LRMs. The approach is model-agnostic and shows consistent improvements in both fidelity and perceptual quality across datasets without retraining, validating the effectiveness of external knowledge grounding in image restoration. The practical impact is significant: ReFIR can be plugged into existing LRMs to reduce hallucinations while maintaining efficiency and generality, enabling higher-fidelity restoration in challenging real-world degradations.

Abstract

Recent advances in diffusion-based Large Restoration Models (LRMs) have significantly improved photo-realistic image restoration by leveraging the internal knowledge embedded within model weights. However, existing LRMs often suffer from the hallucination dilemma, i.e., producing incorrect contents or textures when dealing with severe degradations, due to their heavy reliance on limited internal knowledge. In this paper, we propose an orthogonal solution called the Retrieval-augmented Framework for Image Restoration (ReFIR), which incorporates retrieved images as external knowledge to extend the knowledge boundary of existing LRMs in generating details faithful to the original scene. Specifically, we first introduce the nearest neighbor lookup to retrieve content-relevant high-quality images as reference, after which we propose the cross-image injection to modify existing LRMs to utilize high-quality textures from retrieved images. Thanks to the additional external knowledge, our ReFIR can well handle the hallucination challenge and facilitate faithfully results. Extensive experiments demonstrate that ReFIR can achieve not only high-fidelity but also realistic restoration results. Importantly, our ReFIR requires no training and is adaptable to various LRMs.
Paper Structure (32 sections, 4 equations, 19 figures, 10 tables)

This paper contains 32 sections, 4 equations, 19 figures, 10 tables.

Figures (19)

  • Figure 1: Existing LRMs encounter hallucination issues, i.e., generating contents or details that deviate from the original scene, when dealing with challenging degradations. By incorporating the proposed ReFIR to existing LRMs yu2024supir without any training, the additional external knowledge facilitates producing more faithful results. Please zoom in for better visualization.
  • Figure 2: In-depth visualization about the working mechanism of LRM. Left: we use PCA to visualize the top three principal components of latent extracted from the self-attention layer of the ControlNet and UNet decoder. Right: quantitative power spectrum of the corresponding latent using Fourier analysis. More visualization can be found in \ref{['sec:supp-additional-viz-res']}.
  • Figure 3: Our ReFIR consists of two stages: the Reference Image Retrieval stage employs the retriever $\mathcal{R}$ to search content-relevant images from high-quality image database $\mathcal{D}$, and then the High-fidelity Image Restoration stage restores HQ image with reference images $\mathbf{I_R}$ as condition. The proposed framework is highly generic and can be applied to multiple existing LRMs without any training or fine-tuning.
  • Figure 4: An illustration of cross image injection. Both $\mathcal{C}_T$ and $\mathcal{C}_S$ share the same model weights.
  • Figure 4: Effectiveness of different components in cross image injection.
  • ...and 14 more figures