Table of Contents
Fetching ...

Accelerating Image Generation with Sub-path Linear Approximation Model

Chen Xu, Tianhui Song, Weixin Feng, Xubin Li, Tiezheng Ge, Bo Zheng, Limin Wang

TL;DR

SPLAM addresses the slow inference of diffusion models by modeling PF-ODE trajectories as sub-paths and learning through Sub-Path Linear (SL) ODEs to provide progressive, continuous error estimates. It decomposes the denoising objective into components that SL-ODEs can optimize with smaller cumulative errors and introduces SPLAD to distill these ideas into latent diffusion models, enabling efficient training. Empirical results on LAION and COCO show SPLAM achieving high-quality generation with 2–4 steps, outperforming existing acceleration methods in both FID and image quality while requiring only about 6 A100 GPU days. The approach combines a principled sub-path interpolation, gamma-conditioned training, and selective distillation to deliver practical, fast diffusion-based synthesis with strong generalization across backbones and datasets.

Abstract

Diffusion models have significantly advanced the state of the art in image, audio, and video generation tasks. However, their applications in practical scenarios are hindered by slow inference speed. Drawing inspiration from the approximation strategies utilized in consistency models, we propose the Sub-path Linear Approximation Model (SLAM), which accelerates diffusion models while maintaining high-quality image generation. SLAM treats the PF-ODE trajectory as a series of PF-ODE sub-paths divided by sampled points, and harnesses sub-path linear (SL) ODEs to form a progressive and continuous error estimation along each individual PF-ODE sub-path. The optimization on such SL-ODEs allows SLAM to construct denoising mappings with smaller cumulative approximated errors. An efficient distillation method is also developed to facilitate the incorporation of more advanced diffusion models, such as latent diffusion models. Our extensive experimental results demonstrate that SLAM achieves an efficient training regimen, requiring only 6 A100 GPU days to produce a high-quality generative model capable of 2 to 4-step generation with high performance. Comprehensive evaluations on LAION, MS COCO 2014, and MS COCO 2017 datasets also illustrate that SLAM surpasses existing acceleration methods in few-step generation tasks, achieving state-of-the-art performance both on FID and the quality of the generated images.

Accelerating Image Generation with Sub-path Linear Approximation Model

TL;DR

SPLAM addresses the slow inference of diffusion models by modeling PF-ODE trajectories as sub-paths and learning through Sub-Path Linear (SL) ODEs to provide progressive, continuous error estimates. It decomposes the denoising objective into components that SL-ODEs can optimize with smaller cumulative errors and introduces SPLAD to distill these ideas into latent diffusion models, enabling efficient training. Empirical results on LAION and COCO show SPLAM achieving high-quality generation with 2–4 steps, outperforming existing acceleration methods in both FID and image quality while requiring only about 6 A100 GPU days. The approach combines a principled sub-path interpolation, gamma-conditioned training, and selective distillation to deliver practical, fast diffusion-based synthesis with strong generalization across backbones and datasets.

Abstract

Diffusion models have significantly advanced the state of the art in image, audio, and video generation tasks. However, their applications in practical scenarios are hindered by slow inference speed. Drawing inspiration from the approximation strategies utilized in consistency models, we propose the Sub-path Linear Approximation Model (SLAM), which accelerates diffusion models while maintaining high-quality image generation. SLAM treats the PF-ODE trajectory as a series of PF-ODE sub-paths divided by sampled points, and harnesses sub-path linear (SL) ODEs to form a progressive and continuous error estimation along each individual PF-ODE sub-path. The optimization on such SL-ODEs allows SLAM to construct denoising mappings with smaller cumulative approximated errors. An efficient distillation method is also developed to facilitate the incorporation of more advanced diffusion models, such as latent diffusion models. Our extensive experimental results demonstrate that SLAM achieves an efficient training regimen, requiring only 6 A100 GPU days to produce a high-quality generative model capable of 2 to 4-step generation with high performance. Comprehensive evaluations on LAION, MS COCO 2014, and MS COCO 2017 datasets also illustrate that SLAM surpasses existing acceleration methods in few-step generation tasks, achieving state-of-the-art performance both on FID and the quality of the generated images.
Paper Structure (21 sections, 28 equations, 11 figures, 3 tables, 1 algorithm)

This paper contains 21 sections, 28 equations, 11 figures, 3 tables, 1 algorithm.

Figures (11)

  • Figure 1: Our Sub-Path Linear Approximation Model employs Sub-Path Linear ODEs to approximate the sub-paths on the PF-ODE trajectories, which is determined by the linear interpolation of corresponding endpoints. SPLAM is then trained based on the consistent mapping along SL-ODEs to minimize the approximated errors.
  • Figure 2: (a) Ablations on skipping step size and skipping mechanism. ME denotes for our Multiple Estimation strategy. (b) Training curve comparing LCM and SPLAM. Our SPLAM with step size 100 is conducted with ME, which brings faster convergence. (c) Estimation of the error $\delta$ between consistency mapping values of two adjacent points through PF-ODE. SPLAM consistently outperforms LCM in terms of the error.
  • Figure 3: (a) Visualization for different guidance scale $w$ on SPLAM. (b) The trade-off curve of applying difference guidance scale. $w$ increases from $\{3.0, 5.0, 8.0, 12.0\}$.
  • Figure 4: Comparsion of our SPLAM and LCM LCM in 1,2 and 4-step generation. The results of LCM are based on our reproduction as illustrated in \ref{['sec:exp1']}. SPLAM has generated consistently higher-quality images that are clearer and more detailed. Noteworthy is the remarkable performance of SPLAM in the 2-step generation, which aligns closely with the 4-step generation results of LCM, highlighting the efficiency and effectiveness of our approach in producing high-fidelity images with fewer generation steps.
  • Figure 5: Qualitative Results. The text prompts are selected from DMD dmd in (a) and UFOGEN xu2023ufogen in (b), and the results of the two are also cited from respective papers. Clearly, SPLAM demonstrates the best generation quality in 4-step generation except for the SD models. When decreasing the sampling step to 2, SPLAM still maintains a comparable performance, which generates even better results than 4-step LCM LCM.
  • ...and 6 more figures