Table of Contents
Fetching ...

ImageRAG: Dynamic Image Retrieval for Reference-Guided Image Generation

Rotem Shalev-Arkushin, Rinon Gal, Amit H. Bermano, Ohad Fried

TL;DR

Diffusion-based text-to-image models struggle with rare or unseen concepts. ImageRAG introduces dynamic retrieval of reference images to provide guidance during sampling without requiring RAG-specific training, and it works with multiple base models and prompting controls. A vision-language model identifies missing concepts, generates retrieval captions, and images are retrieved via CLIP-based similarity from a large dataset to augment prompts. Across OmniGen and SDXL, ImageRAG improves rare-concept generation and receives favorable qualitative feedback, showing practical, model-agnostic benefits for reference-guided image synthesis.

Abstract

Diffusion models enable high-quality and diverse visual content synthesis. However, they struggle to generate rare or unseen concepts. To address this challenge, we explore the usage of Retrieval-Augmented Generation (RAG) with image generation models. We propose ImageRAG, a method that dynamically retrieves relevant images based on a given text prompt, and uses them as context to guide the generation process. Prior approaches that used retrieved images to improve generation, trained models specifically for retrieval-based generation. In contrast, ImageRAG leverages the capabilities of existing image conditioning models, and does not require RAG-specific training. Our approach is highly adaptable and can be applied across different model types, showing significant improvement in generating rare and fine-grained concepts using different base models. Our project page is available at: https://rotem-shalev.github.io/ImageRAG

ImageRAG: Dynamic Image Retrieval for Reference-Guided Image Generation

TL;DR

Diffusion-based text-to-image models struggle with rare or unseen concepts. ImageRAG introduces dynamic retrieval of reference images to provide guidance during sampling without requiring RAG-specific training, and it works with multiple base models and prompting controls. A vision-language model identifies missing concepts, generates retrieval captions, and images are retrieved via CLIP-based similarity from a large dataset to augment prompts. Across OmniGen and SDXL, ImageRAG improves rare-concept generation and receives favorable qualitative feedback, showing practical, model-agnostic benefits for reference-guided image synthesis.

Abstract

Diffusion models enable high-quality and diverse visual content synthesis. However, they struggle to generate rare or unseen concepts. To address this challenge, we explore the usage of Retrieval-Augmented Generation (RAG) with image generation models. We propose ImageRAG, a method that dynamically retrieves relevant images based on a given text prompt, and uses them as context to guide the generation process. Prior approaches that used retrieved images to improve generation, trained models specifically for retrieval-based generation. In contrast, ImageRAG leverages the capabilities of existing image conditioning models, and does not require RAG-specific training. Our approach is highly adaptable and can be applied across different model types, showing significant improvement in generating rare and fine-grained concepts using different base models. Our project page is available at: https://rotem-shalev.github.io/ImageRAG

Paper Structure

This paper contains 22 sections, 11 figures, 5 tables.

Figures (11)

  • Figure 1: Using references broadens the generation capabilities of image generation models. Given a text prompt, ImageRAG dynamically retrieves relevant images and provides them to a base text-to-image model (T2I). ImageRAG works with different models, such as SDXL (A) or OmniGen (B, C), and different controls, e.g. text (A, B) or personalization (C).
  • Figure 2: Hallucinations. When models do not know the meaning of a prompt, they may "hallucinate" and generate unrelated images (left). By applying our method to retrieve and utilize relevant references (mid), the base models can generate appropriate images (right).
  • Figure 3: Top: a high-level overview of our method. Given a text prompt $\mathord{<}p\mathord{>}$, we generate an initial image using a text-to-image (T2I) model. Then, we generate retrieval-captions $\mathord{<}c_j\mathord{>}$, retrieve images from an external database for each caption $\mathord{<}i_j\mathord{>}$, and use them as references to the model for better generation. Bottom: the retrieval-caption generation block. We use a VLM to decide if the initial image matches the given prompt. If not, we ask it to list the missing concepts, and to create a caption that could be used to retrieve appropriate examples for each of these missing concepts.
  • Figure 4: Personalized generation example.ImageRAG can work in parallel with personalization methods and enhance their capabilities. For example, although OmniGen can generate images of a subject based on an image, it struggles to generate some concepts. Using references retrieved by our method, it can generate the required result.
  • Figure 5: Retrieval dataset size vs. CLIP score on ImageNet (left) and Aircraft (right). Dashed lines represent the scores of the base models. Even relatively small, unspecialized retrieval sets can already improve results. More data leads to further increased scores. However, small sets may not contain relevant retrieval examples, and their use may harm results, particularly for stronger models.
  • ...and 6 more figures