Table of Contents
Fetching ...

RISSOLE: Parameter-efficient Diffusion Models via Block-wise Generation and Retrieval-Guidance

Avideep Mukherjee, Soumya Banerjee, Piyush Rai, Vinay P. Namboodiri

TL;DR

RISSOLE tackles the challenge of deploying diffusion models on resource-limited devices by combining block-wise latent diffusion with retrieval-guided conditioning to enforce cross-block coherence. Each latent block $z^i$ is conditioned on corresponding blocks retrieved from an external database, using retrieval-augmented generation to guide training and sampling. The approach builds on a VQ-GAN latent space and a block-wise DDPM pipeline, enabling parallel training/sampling and reducing parameter count while achieving strong generation quality. Experiments on CelebA64 and ImageNet100 show improved FID and coherent samples at comparable or smaller model sizes than state-of-the-art compact diffusion methods, highlighting practical benefits for low-resource deployment.

Abstract

Diffusion-based models demonstrate impressive generation capabilities. However, they also have a massive number of parameters, resulting in enormous model sizes, thus making them unsuitable for deployment on resource-constraint devices. Block-wise generation can be a promising alternative for designing compact-sized (parameter-efficient) deep generative models since the model can generate one block at a time instead of generating the whole image at once. However, block-wise generation is also considerably challenging because ensuring coherence across generated blocks can be non-trivial. To this end, we design a retrieval-augmented generation (RAG) approach and leverage the corresponding blocks of the images retrieved by the RAG module to condition the training and generation stages of a block-wise denoising diffusion model. Our conditioning schemes ensure coherence across the different blocks during training and, consequently, during generation. While we showcase our approach using the latent diffusion model (LDM) as the base model, it can be used with other variants of denoising diffusion models. We validate the solution of the coherence problem through the proposed approach by reporting substantive experiments to demonstrate our approach's effectiveness in compact model size and excellent generation quality.

RISSOLE: Parameter-efficient Diffusion Models via Block-wise Generation and Retrieval-Guidance

TL;DR

RISSOLE tackles the challenge of deploying diffusion models on resource-limited devices by combining block-wise latent diffusion with retrieval-guided conditioning to enforce cross-block coherence. Each latent block is conditioned on corresponding blocks retrieved from an external database, using retrieval-augmented generation to guide training and sampling. The approach builds on a VQ-GAN latent space and a block-wise DDPM pipeline, enabling parallel training/sampling and reducing parameter count while achieving strong generation quality. Experiments on CelebA64 and ImageNet100 show improved FID and coherent samples at comparable or smaller model sizes than state-of-the-art compact diffusion methods, highlighting practical benefits for low-resource deployment.

Abstract

Diffusion-based models demonstrate impressive generation capabilities. However, they also have a massive number of parameters, resulting in enormous model sizes, thus making them unsuitable for deployment on resource-constraint devices. Block-wise generation can be a promising alternative for designing compact-sized (parameter-efficient) deep generative models since the model can generate one block at a time instead of generating the whole image at once. However, block-wise generation is also considerably challenging because ensuring coherence across generated blocks can be non-trivial. To this end, we design a retrieval-augmented generation (RAG) approach and leverage the corresponding blocks of the images retrieved by the RAG module to condition the training and generation stages of a block-wise denoising diffusion model. Our conditioning schemes ensure coherence across the different blocks during training and, consequently, during generation. While we showcase our approach using the latent diffusion model (LDM) as the base model, it can be used with other variants of denoising diffusion models. We validate the solution of the coherence problem through the proposed approach by reporting substantive experiments to demonstrate our approach's effectiveness in compact model size and excellent generation quality.
Paper Structure (18 sections, 3 equations, 6 figures, 1 table)

This paper contains 18 sections, 3 equations, 6 figures, 1 table.

Figures (6)

  • Figure 1: Training and Sampling of RISSOLE: (a) During training, each image $x$ goes through a VQ-GAN encoder $E_\theta$, resulting in the generation of a latent representation $z$. (b) This latent representation $z$ is then utilized by the retriever $\xi_k(.)$ to fetch its $k$ nearest neighbors $\mathcal{M}_{\mathcal{D}}^{(k)}$ from a dataset $\mathcal{D}$. (c) Each group of blocks $\mathcal{M}_{\mathcal{D}_i}^{(k)} \in \mathcal{M}_{\mathcal{D}}^{(k)}$ is used as a conditioning signal for the corresponding $z^i$, aiding in the training procedure. (d) During the sampling (generation) phase, a pseudo-query $\hat{z}^0$ obtained from the dataset $\mathcal{D}$ is employed by $\xi_k(.)$ to retrieve $\mathcal{M}_{\mathcal{D}}^{(k)}$. (e) From the retrieved set $\mathcal{M}_{\mathcal{D}}^{(k)}$, each subset $\mathcal{M}_{\mathcal{D}_i}^{(k)}$ is used as a conditioning signal for the random noise $z^i_t$ at steps $t=T,T-1,\ldots,3,2,1$ to generate the final denoised block $z_0^i$ of $\hat{z}^0$. (f) All denoised representations $z_0^i$ are reshaped to construct $z_0$, which is then passed through the decoder $D_\phi$ to yield the reconstructed sample $x'$.
  • Figure 2: Each of the three rows above show a pseudo-query image $\hat{x}$ (used in generation time) from $\mathcal{D}$, its retrieved neighbors, and the generated sample when these neighbors are conditioned on. Note that the actual training and sampling occur in the latent space. These images, decoded from the latent representations, are for better understanding and visualization.
  • Figure 3: Original Images (top row), and samples generated by the RDM baseline (middle row) and by RISSOLE (bottom row), trained on CelebA and ImageNet 100 datasets.
  • Figure 4: Qualitative Samples from RISSOLE models where the input is conditioned with (top) and without (bottom) the positional information.
  • Figure 5: Samples from RISSOLE with (top) and without(bottom) using the previous block as a condition.
  • ...and 1 more figures