Table of Contents
Fetching ...

DreamCache: Finetuning-Free Lightweight Personalized Image Generation via Feature Caching

Emanuele Aiello, Umberto Michieli, Diego Valsesia, Mete Ozay, Enrico Magli

TL;DR

This paper introduces DreamCache, a scalable approach for efficient and high-quality personalized image generation that achieves state-of-the-art image and text alignment, utilizing an order of magnitude fewer extra parameters, and is both more computationally effective and versatile than existing models.

Abstract

Personalized image generation requires text-to-image generative models that capture the core features of a reference subject to allow for controlled generation across different contexts. Existing methods face challenges due to complex training requirements, high inference costs, limited flexibility, or a combination of these issues. In this paper, we introduce DreamCache, a scalable approach for efficient and high-quality personalized image generation. By caching a small number of reference image features from a subset of layers and a single timestep of the pretrained diffusion denoiser, DreamCache enables dynamic modulation of the generated image features through lightweight, trained conditioning adapters. DreamCache achieves state-of-the-art image and text alignment, utilizing an order of magnitude fewer extra parameters, and is both more computationally effective and versatile than existing models.

DreamCache: Finetuning-Free Lightweight Personalized Image Generation via Feature Caching

TL;DR

This paper introduces DreamCache, a scalable approach for efficient and high-quality personalized image generation that achieves state-of-the-art image and text alignment, utilizing an order of magnitude fewer extra parameters, and is both more computationally effective and versatile than existing models.

Abstract

Personalized image generation requires text-to-image generative models that capture the core features of a reference subject to allow for controlled generation across different contexts. Existing methods face challenges due to complex training requirements, high inference costs, limited flexibility, or a combination of these issues. In this paper, we introduce DreamCache, a scalable approach for efficient and high-quality personalized image generation. By caching a small number of reference image features from a subset of layers and a single timestep of the pretrained diffusion denoiser, DreamCache enables dynamic modulation of the generated image features through lightweight, trained conditioning adapters. DreamCache achieves state-of-the-art image and text alignment, utilizing an order of magnitude fewer extra parameters, and is both more computationally effective and versatile than existing models.

Paper Structure

This paper contains 30 sections, 5 equations, 10 figures, 10 tables.

Figures (10)

  • Figure 1: DreamCache is a finetuning-free personalized image generation method that achieves an optimal balance between subject fidelity, memory efficiency, and adherence to text prompts.
  • Figure 2: Personalized generations by DreamCache. The first column contains reference images. The generated images correspond to the text prompts above each column.
  • Figure 3: Overview of DreamCache. Original U-Net layers are shown in violet, while the novel components introduced by DreamCache are highlighted in green. During personalization, features from selected layers of the diffusion denoiser are cached from a single timestep, using a null text prompt. These cached features serve as reference-specific information. During generation, conditioning adapters inject the cached features into the denoiser, modulating the features of the generated image to create a personalized output.
  • Figure 4: Visual comparison. Personalized generations on sample concepts. DreamCache preserves reference concept appearance and does not suffer from background interference. BLIP-D li2023blip and Kosmos-G pan2023kosmos cannot faithfully preserve visual details from the reference.
  • Figure 5: Visualization of reference image impact. Cross-attention maps between cached reference features and features of the image under generation. Left: attention map at layers at $16\times 16$ resolution (left reference, right generated). Right:$32 \times 32$. Attention values are highly localized in the region of interest.
  • ...and 5 more figures