Table of Contents
Fetching ...

Bidirectional Consistency Models

Liangchen Li, Jiajun He

TL;DR

BCM addresses the bottleneck of slow generation and challenging inversion in diffusion models by learning a single network that can traverse the PF ODE bidirectionally, unifying generation and inversion under a shared trajectory-centric objective. Through Bidirectional Consistency Training and a specialized network parameterization, BCM supports efficient one-step generation and inversion and enables powerful downstream tasks such as interpolation and inpainting. The method demonstrates competitive generation quality with far fewer NFEs and offers flexible sampling strategies (ancestral and zigzag) that leverage bidirectional traversal; it also enables inversion-driven applications like real-to-real image interpolation and blind restoration. While presenting strong gains and versatile capabilities, BCM acknowledges limits in inversion fidelity and diminishing returns with excessive NFEs, suggesting directions for improved inversion and task-specific fine-tuning.

Abstract

Diffusion models (DMs) are capable of generating remarkably high-quality samples by iteratively denoising a random vector, a process that corresponds to moving along the probability flow ordinary differential equation (PF ODE). Interestingly, DMs can also invert an input image to noise by moving backward along the PF ODE, a key operation for downstream tasks such as interpolation and image editing. However, the iterative nature of this process restricts its speed, hindering its broader application. Recently, Consistency Models (CMs) have emerged to address this challenge by approximating the integral of the PF ODE, largely reducing the number of iterations. Yet, the absence of an explicit ODE solver complicates the inversion process. To resolve this, we introduce Bidirectional Consistency Model (BCM), which learns a single neural network that enables both forward and backward traversal along the PF ODE, efficiently unifying generation and inversion tasks within one framework. We can train BCM from scratch or tune it using a pretrained consistency model, which reduces the training cost and increases scalability. We demonstrate that BCM enables one-step generation and inversion while also allowing the use of additional steps to enhance generation quality or reduce reconstruction error. We further showcase BCM's capability in downstream tasks, such as interpolation and inpainting. Our code and weights are available at https://github.com/Mosasaur5526/BCM-iCT-torch.

Bidirectional Consistency Models

TL;DR

BCM addresses the bottleneck of slow generation and challenging inversion in diffusion models by learning a single network that can traverse the PF ODE bidirectionally, unifying generation and inversion under a shared trajectory-centric objective. Through Bidirectional Consistency Training and a specialized network parameterization, BCM supports efficient one-step generation and inversion and enables powerful downstream tasks such as interpolation and inpainting. The method demonstrates competitive generation quality with far fewer NFEs and offers flexible sampling strategies (ancestral and zigzag) that leverage bidirectional traversal; it also enables inversion-driven applications like real-to-real image interpolation and blind restoration. While presenting strong gains and versatile capabilities, BCM acknowledges limits in inversion fidelity and diminishing returns with excessive NFEs, suggesting directions for improved inversion and task-specific fine-tuning.

Abstract

Diffusion models (DMs) are capable of generating remarkably high-quality samples by iteratively denoising a random vector, a process that corresponds to moving along the probability flow ordinary differential equation (PF ODE). Interestingly, DMs can also invert an input image to noise by moving backward along the PF ODE, a key operation for downstream tasks such as interpolation and image editing. However, the iterative nature of this process restricts its speed, hindering its broader application. Recently, Consistency Models (CMs) have emerged to address this challenge by approximating the integral of the PF ODE, largely reducing the number of iterations. Yet, the absence of an explicit ODE solver complicates the inversion process. To resolve this, we introduce Bidirectional Consistency Model (BCM), which learns a single neural network that enables both forward and backward traversal along the PF ODE, efficiently unifying generation and inversion tasks within one framework. We can train BCM from scratch or tune it using a pretrained consistency model, which reduces the training cost and increases scalability. We demonstrate that BCM enables one-step generation and inversion while also allowing the use of additional steps to enhance generation quality or reduce reconstruction error. We further showcase BCM's capability in downstream tasks, such as interpolation and inpainting. Our code and weights are available at https://github.com/Mosasaur5526/BCM-iCT-torch.
Paper Structure (35 sections, 30 equations, 30 figures, 2 tables, 7 algorithms)

This paper contains 35 sections, 30 equations, 30 figures, 2 tables, 7 algorithms.

Figures (30)

  • Figure 1: An illustrative comparison of score-based diffusion models, consistency models, consistency trajectory models, and our proposed bidirectional consistency models. (a) DM estimates the score function at a given time step; (b) CM enforces self-consistency that different points on the same trajectory map to the same initial points; (c) CTM strengthens this principle of consistency, which maps a point at time $t$ back to another point at time $u \leq t$ along the same trajectory. (d) BCM is designed to map any two points on the same trajectory to each other, removing any restrictions on the mapping direction. When the mapping direction aligns with the diffusion direction, the model adds noise to an input image. Conversely, if the mapping direction is opposite, the model performs denoising. This approach unifies generation and inversion tasks into a single, cohesive framework.
  • Figure 1: Sample quality on CIFAR-10 (left) and ImageNet-64 (right). We train BCM from scratch on CIFAR-10 and fine-tune it using our reproduced iCT model on ImageNet-64. $^*$Results estimated from Figure 13 in kim2023consistency. $^\dag$For our BCM and BCM-deep, we use ancestral sampling when NFE=2, zigzag sampling when NFE=3, and the combination of both when NFE=4. Our results indicate that ancestral and zigzag sampling can individually improve FID, and their combination can achieve even better performance. $^{**}$Results by our reproduction.
  • Figure 2: Comparison of different strategies of adding fresh noise in zigzag sampling. (a) 1-step generation. (b) Zigzag sampling with manually added fresh noise, where the new noises drastically alter the content. (c) Zigzag sampling with manually added, fixed noise, i.e., we fix the injected fresh noise in each iteration to be the same as the initial one. We can see that the quality significantly deteriorates. (d) Zigzag sampling with BCM. At each iteration, we apply a small amount of noise and let the network amplify it. We can see that the image content is mostly maintained.
  • Figure 3: Samples by BCM-deep on CIFAR-10.
  • Figure 4: Samples by BCM-deep on ImageNet-64.
  • ...and 25 more figures