Table of Contents
Fetching ...

TLCM: Training-efficient Latent Consistency Model for Image Generation with 2-8 Steps

Qingsong Xie, Zhenyi Liao, Zhijie Deng, Chen chen, Haonan Lu

TL;DR

This paper proposes a novel training-efficient Latent Consistency Model (TLCM) that accelerates LDMs via data-free multistep latent consistency distillation (MLCD), and then data-free latent consistency distillation is proposed to efficiently guarantee the inter-segment consistency in MLCD.

Abstract

Distilling latent diffusion models (LDMs) into ones that are fast to sample from is attracting growing research interest. However, the majority of existing methods face two critical challenges: (1) They hinge on long training using a huge volume of real data. (2) They routinely lead to quality degradation for generation, especially in text-image alignment. This paper proposes a novel training-efficient Latent Consistency Model (TLCM) to overcome these challenges. Our method first accelerates LDMs via data-free multistep latent consistency distillation (MLCD), and then data-free latent consistency distillation is proposed to efficiently guarantee the inter-segment consistency in MLCD. Furthermore, we introduce bags of techniques, e.g., distribution matching, adversarial learning, and preference learning, to enhance TLCM's performance at few-step inference without any real data. TLCM demonstrates a high level of flexibility by enabling adjustment of sampling steps within the range of 2 to 8 while still producing competitive outputs compared to full-step approaches. Notably, TLCM enjoys the data-free merit by employing synthetic data from the teacher for distillation. With just 70 training hours on an A100 GPU, a 3-step TLCM distilled from SDXL achieves an impressive CLIP Score of 33.68 and an Aesthetic Score of 5.97 on the MSCOCO-2017 5K benchmark, surpassing various accelerated models and even outperforming the teacher model in human preference metrics. We also demonstrate the versatility of TLCMs in applications including image style transfer, controllable generation, and Chinese-to-image generation.

TLCM: Training-efficient Latent Consistency Model for Image Generation with 2-8 Steps

TL;DR

This paper proposes a novel training-efficient Latent Consistency Model (TLCM) that accelerates LDMs via data-free multistep latent consistency distillation (MLCD), and then data-free latent consistency distillation is proposed to efficiently guarantee the inter-segment consistency in MLCD.

Abstract

Distilling latent diffusion models (LDMs) into ones that are fast to sample from is attracting growing research interest. However, the majority of existing methods face two critical challenges: (1) They hinge on long training using a huge volume of real data. (2) They routinely lead to quality degradation for generation, especially in text-image alignment. This paper proposes a novel training-efficient Latent Consistency Model (TLCM) to overcome these challenges. Our method first accelerates LDMs via data-free multistep latent consistency distillation (MLCD), and then data-free latent consistency distillation is proposed to efficiently guarantee the inter-segment consistency in MLCD. Furthermore, we introduce bags of techniques, e.g., distribution matching, adversarial learning, and preference learning, to enhance TLCM's performance at few-step inference without any real data. TLCM demonstrates a high level of flexibility by enabling adjustment of sampling steps within the range of 2 to 8 while still producing competitive outputs compared to full-step approaches. Notably, TLCM enjoys the data-free merit by employing synthetic data from the teacher for distillation. With just 70 training hours on an A100 GPU, a 3-step TLCM distilled from SDXL achieves an impressive CLIP Score of 33.68 and an Aesthetic Score of 5.97 on the MSCOCO-2017 5K benchmark, surpassing various accelerated models and even outperforming the teacher model in human preference metrics. We also demonstrate the versatility of TLCMs in applications including image style transfer, controllable generation, and Chinese-to-image generation.
Paper Structure (20 sections, 13 equations, 6 figures, 3 tables, 2 algorithms)

This paper contains 20 sections, 13 equations, 6 figures, 3 tables, 2 algorithms.

Figures (6)

  • Figure 1: $1024 \times1024$ samples from TLCM, distilled from SDXL-base-1.0 podellsdxl based on LoRA hulora. From top to bottom, 2, 3, 4 and 8 sampling steps are adopted, respectively. Apart from satisfactory visual quality, TLCM can also yield improved metrics compared to strong baselines.
  • Figure 2: The overview for training TLCM. Data-free multistep latent consistency distillation is first used to accelerate LDM, obtaining initial TLCM (left part of the overview). Then, improved data-free latent consistency distillation is proposed to enforce the global consistency of TLCM. MPS optimization, DM, and adversarial learning are exploited to promote TLCM's performance in a data-free manner (right part of the overview). Note that we omit the Latent LPIPS model for brevity.
  • Figure 3: Visual comparison between our TLCM and the state-of-the-art methods. Zoom in for more details.
  • Figure 4: TLCM with image style transfer. The styles are presented at the top, and we apply image style transfer on the source image with our TLCM. Two-step sampling can produce highly stylized images with excellent results.
  • Figure 5: TLCM with ControlNet. Our TLCM can be incorporated into ControlNet pipeline and produce satisfactory results with 2 steps sampling.
  • ...and 1 more figures