Table of Contents
Fetching ...

Controlling Latent Diffusion Using Latent CLIP

Jason Becker, Chris Wendler, Peter Baylies, Robert West, Christian Wressnegger

TL;DR

Latent-CLIP enables computing CLIP embeddings directly in the latent space of latent diffusion models, removing the need to decode to pixel space and saving computation. It trains on 2.7B latent image–text pairs and matches zero-shot ImageNet performance of similarly sized pixel-space CLIP models, while also enabling Latent-CLIP rewards that rival pixel CLIP in ReNO tasks with a ~21% runtime reduction. In safety settings, Latent-CLIP-based rewards mitigate inappropriate content and cultural biases without intermediate decoding, showing practical benefits for safe and scalable diffusion pipelines. Overall, the approach demonstrates a practical path to efficient, latent-domain semantic alignment and control for large-scale generative models.

Abstract

Instead of performing text-conditioned denoising in the image domain, latent diffusion models (LDMs) operate in latent space of a variational autoencoder (VAE), enabling more efficient processing at reduced computational costs. However, while the diffusion process has moved to the latent space, the contrastive language-image pre-training (CLIP) models, as used in many image processing tasks, still operate in pixel space. Doing so requires costly VAE-decoding of latent images before they can be processed. In this paper, we introduce Latent-CLIP, a CLIP model that operates directly in the latent space. We train Latent-CLIP on 2.7B pairs of latent images and descriptive texts, and show that it matches zero-shot classification performance of similarly sized CLIP models on both the ImageNet benchmark and a LDM-generated version of it, demonstrating its effectiveness in assessing both real and generated content. Furthermore, we construct Latent-CLIP rewards for reward-based noise optimization (ReNO) and show that they match the performance of their CLIP counterparts on GenEval and T2I-CompBench while cutting the cost of the total pipeline by 21%. Finally, we use Latent-CLIP to guide generation away from harmful content, achieving strong performance on the inappropriate image prompts (I2P) benchmark and a custom evaluation, without ever requiring the costly step of decoding intermediate images.

Controlling Latent Diffusion Using Latent CLIP

TL;DR

Latent-CLIP enables computing CLIP embeddings directly in the latent space of latent diffusion models, removing the need to decode to pixel space and saving computation. It trains on 2.7B latent image–text pairs and matches zero-shot ImageNet performance of similarly sized pixel-space CLIP models, while also enabling Latent-CLIP rewards that rival pixel CLIP in ReNO tasks with a ~21% runtime reduction. In safety settings, Latent-CLIP-based rewards mitigate inappropriate content and cultural biases without intermediate decoding, showing practical benefits for safe and scalable diffusion pipelines. Overall, the approach demonstrates a practical path to efficient, latent-domain semantic alignment and control for large-scale generative models.

Abstract

Instead of performing text-conditioned denoising in the image domain, latent diffusion models (LDMs) operate in latent space of a variational autoencoder (VAE), enabling more efficient processing at reduced computational costs. However, while the diffusion process has moved to the latent space, the contrastive language-image pre-training (CLIP) models, as used in many image processing tasks, still operate in pixel space. Doing so requires costly VAE-decoding of latent images before they can be processed. In this paper, we introduce Latent-CLIP, a CLIP model that operates directly in the latent space. We train Latent-CLIP on 2.7B pairs of latent images and descriptive texts, and show that it matches zero-shot classification performance of similarly sized CLIP models on both the ImageNet benchmark and a LDM-generated version of it, demonstrating its effectiveness in assessing both real and generated content. Furthermore, we construct Latent-CLIP rewards for reward-based noise optimization (ReNO) and show that they match the performance of their CLIP counterparts on GenEval and T2I-CompBench while cutting the cost of the total pipeline by 21%. Finally, we use Latent-CLIP to guide generation away from harmful content, achieving strong performance on the inappropriate image prompts (I2P) benchmark and a custom evaluation, without ever requiring the costly step of decoding intermediate images.

Paper Structure

This paper contains 19 sections, 2 equations, 10 figures, 6 tables.

Figures (10)

  • Figure 1: We propose to compute CLIP embeddings for latent images directly using Latent-CLIP (bottom) instead of VAE-decoding them first (top). Latent-CLIP can serve as a drop-in replacement of CLIP, preserving performance while saving computation. We use the technique from https://huggingface.co/blog/TimothyAlexisVass/explaining-the-sdxl-latent-space to compute the preview of the latent image.
  • Figure 2: Comparison of image pairs generated by SDXL-Turbo
  • Figure 3: Comparison of images generated using SDXL with the prompt "{label}" and original ImageNet images. The generated images show high similarity, while ImageNet images exhibit greater variety.
  • Figure 4: Images generated from T2I-CompBench prompts using SDXL-Turbo without reward optimization, compared to outputs optimized with CLIPScore-based rewards from traditional CLIP and latent space models.
  • Figure 5: Images generated from T2I-CompBench prompts using SDXL-Turbo without reward optimization, compared to outputs optimized with PickScore-based rewards from traditional CLIP and latent space models.
  • ...and 5 more figures