Table of Contents
Fetching ...

Improved Training Technique for Latent Consistency Models

Quan Dao, Khanh Doan, Di Liu, Trung Le, Dimitris Metaxas

TL;DR

The paper tackles the challenge of scaling consistency training to latent spaces for large-scale text-to-image and video generation by diagnosing impulsive outliers and unstable temporal-difference signals as primary bottlenecks. It introduces a robust toolkit—Cauchy loss, diffusion loss at early timesteps, OT matching, adaptive scaling $c$, and Non-scaling LayerNorm—to stabilize latent consistency training. Empirical results on CelebA-HQ, FFHQ, and LSUN Church show that 1- to 2-step latent consistency modeling can achieve Fréchet Inception Distance (FID) in the low double digits and can surpass the prior latent-CM baseline, though still generally lagging behind full latent diffusion models in some metrics. The approach significantly reduces sampling cost while delivering high-quality samples, and the release of code enables broader adoption and further improvements, including potential integration with Consistency Trajectory Models.

Abstract

Consistency models are a new family of generative models capable of producing high-quality samples in either a single step or multiple steps. Recently, consistency models have demonstrated impressive performance, achieving results on par with diffusion models in the pixel space. However, the success of scaling consistency training to large-scale datasets, particularly for text-to-image and video generation tasks, is determined by performance in the latent space. In this work, we analyze the statistical differences between pixel and latent spaces, discovering that latent data often contains highly impulsive outliers, which significantly degrade the performance of iCT in the latent space. To address this, we replace Pseudo-Huber losses with Cauchy losses, effectively mitigating the impact of outliers. Additionally, we introduce a diffusion loss at early timesteps and employ optimal transport (OT) coupling to further enhance performance. Lastly, we introduce the adaptive scaling-$c$ scheduler to manage the robust training process and adopt Non-scaling LayerNorm in the architecture to better capture the statistics of the features and reduce outlier impact. With these strategies, we successfully train latent consistency models capable of high-quality sampling with one or two steps, significantly narrowing the performance gap between latent consistency and diffusion models. The implementation is released here: https://github.com/quandao10/sLCT/

Improved Training Technique for Latent Consistency Models

TL;DR

The paper tackles the challenge of scaling consistency training to latent spaces for large-scale text-to-image and video generation by diagnosing impulsive outliers and unstable temporal-difference signals as primary bottlenecks. It introduces a robust toolkit—Cauchy loss, diffusion loss at early timesteps, OT matching, adaptive scaling , and Non-scaling LayerNorm—to stabilize latent consistency training. Empirical results on CelebA-HQ, FFHQ, and LSUN Church show that 1- to 2-step latent consistency modeling can achieve Fréchet Inception Distance (FID) in the low double digits and can surpass the prior latent-CM baseline, though still generally lagging behind full latent diffusion models in some metrics. The approach significantly reduces sampling cost while delivering high-quality samples, and the release of code enables broader adoption and further improvements, including potential integration with Consistency Trajectory Models.

Abstract

Consistency models are a new family of generative models capable of producing high-quality samples in either a single step or multiple steps. Recently, consistency models have demonstrated impressive performance, achieving results on par with diffusion models in the pixel space. However, the success of scaling consistency training to large-scale datasets, particularly for text-to-image and video generation tasks, is determined by performance in the latent space. In this work, we analyze the statistical differences between pixel and latent spaces, discovering that latent data often contains highly impulsive outliers, which significantly degrade the performance of iCT in the latent space. To address this, we replace Pseudo-Huber losses with Cauchy losses, effectively mitigating the impact of outliers. Additionally, we introduce a diffusion loss at early timesteps and employ optimal transport (OT) coupling to further enhance performance. Lastly, we introduce the adaptive scaling- scheduler to manage the robust training process and adopt Non-scaling LayerNorm in the architecture to better capture the statistics of the features and reduce outlier impact. With these strategies, we successfully train latent consistency models capable of high-quality sampling with one or two steps, significantly narrowing the performance gap between latent consistency and diffusion models. The implementation is released here: https://github.com/quandao10/sLCT/

Paper Structure

This paper contains 15 sections, 12 equations, 13 figures, 2 tables.

Figures (13)

  • Figure 1: Box and Whisker Plot: Impulsive noise comparison between pixel and latent spaces. The right column shows the statistics of TD values at 21 discretization steps. Other discretization steps exhibit same behavior, where impulsive outliers are consistently present regardless of the total discretization steps. The blue boxes represent interquartile ranges of the data, while the green and orange dashed lines indicate inner and outer fences, respectively. Outliers are marked with red dots.
  • Figure 2: Analysis of robust loss: Pseudo-Huber, Cauchy, and Geman-McClure
  • Figure 3: Model convergence plot on different $c$ schedule. (Left) Our proposed $c$ values. Performance on FID (Middle) and Recall (Right) of our proposed $c$ in comparison with different choices.
  • Figure 4: Our qualitative results using 1-NFE at resolution $256 \times 256$
  • Figure 5: iLCT qualitative results using 1-NFE at resolution $256 \times 256$
  • ...and 8 more figures