Table of Contents
Fetching ...

LatentCRF: Continuous CRF for Efficient Latent Diffusion

Kanchana Ranasinghe, Sadeep Jayasumana, Andreas Veit, Ayan Chakrabarti, Daniel Glasner, Michael S Ryoo, Srikumar Ramalingam, Sanjiv Kumar

TL;DR

LatentCRF presents a continuous CRF layer that operates in the latent space of Latent Diffusion Models to accelerate inference by replacing several U-Net iterations with a lightweight, trainable inference step. The model incorporates unary, pairwise, and higher-order energies, including a Field-of-Experts prior, and uses differentiable mean-field updates with learned conditioning to capture spatial and semantic consistencies. Training combines a latent denoising loss with a latent-space adversarial objective, and the approach can be paired with distillation from LDM to mimic later steps, achieving a 33% speedup with negligible losses in image quality and diversity. Compared to prior distillation and compression methods, LatentCRF better preserves diversity while maintaining high visual fidelity, and it requires no modification to the base LDM, making it a practical add-on for accelerating diffusion-based image generation.

Abstract

Latent Diffusion Models (LDMs) produce high-quality, photo-realistic images, however, the latency incurred by multiple costly inference iterations can restrict their applicability. We introduce LatentCRF, a continuous Conditional Random Field (CRF) model, implemented as a neural network layer, that models the spatial and semantic relationships among the latent vectors in the LDM. By replacing some of the computationally-intensive LDM inference iterations with our lightweight LatentCRF, we achieve a superior balance between quality, speed and diversity. We increase inference efficiency by 33% with no loss in image quality or diversity compared to the full LDM. LatentCRF is an easy add-on, which does not require modifying the LDM.

LatentCRF: Continuous CRF for Efficient Latent Diffusion

TL;DR

LatentCRF presents a continuous CRF layer that operates in the latent space of Latent Diffusion Models to accelerate inference by replacing several U-Net iterations with a lightweight, trainable inference step. The model incorporates unary, pairwise, and higher-order energies, including a Field-of-Experts prior, and uses differentiable mean-field updates with learned conditioning to capture spatial and semantic consistencies. Training combines a latent denoising loss with a latent-space adversarial objective, and the approach can be paired with distillation from LDM to mimic later steps, achieving a 33% speedup with negligible losses in image quality and diversity. Compared to prior distillation and compression methods, LatentCRF better preserves diversity while maintaining high visual fidelity, and it requires no modification to the base LDM, making it a practical add-on for accelerating diffusion-based image generation.

Abstract

Latent Diffusion Models (LDMs) produce high-quality, photo-realistic images, however, the latency incurred by multiple costly inference iterations can restrict their applicability. We introduce LatentCRF, a continuous Conditional Random Field (CRF) model, implemented as a neural network layer, that models the spatial and semantic relationships among the latent vectors in the LDM. By replacing some of the computationally-intensive LDM inference iterations with our lightweight LatentCRF, we achieve a superior balance between quality, speed and diversity. We increase inference efficiency by 33% with no loss in image quality or diversity compared to the full LDM. LatentCRF is an easy add-on, which does not require modifying the LDM.

Paper Structure

This paper contains 28 sections, 16 equations, 11 figures, 8 tables, 1 algorithm.

Figures (11)

  • Figure 1: Overview of LatentCRF: We replace several LDM inference iterations with an application of our CRF in the LDM's latent space. Our LatentCRF modifies latent vectors with pairwise and higher-order interactions to better align with the distribution of natural image latents. The CRF inference cost is insignificant compared to the LDM's U-Net, leading to significant savings in inference time.
  • Figure 2: Diversity of generations: We generate multiple images for the prompt 'A cinematic shot of a baby racoon wearing an intricate italian priest robe.' with varying input noise. We observe that our LatentCRF (bottom) retains the diversity of LDM Rombach2021HighResolutionIS teacher (top). In contrast, distillation-based approaches like SDXL-Turbo Sauer2023AdversarialDD (middle) tend to suffer from decreased diversity (more details in \ref{['table:diversity']}).
  • Figure 3: Qualitative Results: Within each set of two, LatentCRF (left) speeds up LDM (right) by 33% while maintaining image quality.
  • Figure A.1: Reverse Diffusion Variance: For a 50 time-step DDIM reverse diffusion process, at each iteration we calculate the variance of the generated latents (shaped $h' \times w' \times d$) across all its dimensions. We average these values across 1632 images generated using the Parti Prompts yu2022scaling. Note how the variance drops significantly in the latter iterations ($t > 40$), while the early iterations ($t < 20$) in particular exhibit high variance. We intuit that the large capacity U-Nets within an LDM are well suited for guiding latent modifications in this high variance regime. On the other hand, more lightweight modules such as our LatentCRF are capable of operating in the later, low variance iterations, leading to significant inference speed-ups.
  • Figure A.2: CRF Convergence: We visualize images generated by LatentCRF for varying num_iterations parameters. We illustrate generated images for values of 1, 2, 3, 4, 5, and 10 in each row from top to bottom respectively. On careful inspection, slight changes are visible in the early iterations (e.g going from iteration 1 to 2 shows color and brightness changes in background). However, the later iterations show no visual changes at all (e.g. beyond step 4), indicating that our CRF inference has converged. All images are generated using the common prompt of "A photograph of a dog in a field.".
  • ...and 6 more figures