Table of Contents
Fetching ...

TurboFill: Adapting Few-step Text-to-image Model for Fast Image Inpainting

Liangbin Xie, Daniil Pakhomov, Zhonghao Wang, Zongze Wu, Ziyan Chen, Yuqian Zhou, Haitian Zheng, Zhifei Zhang, Zhe Lin, Jiantao Zhou, Chao Dong

TL;DR

TurboFill addresses the high computational cost of diffusion-based image inpainting by training an inpainting adapter directly on a few-step diffusion model using a 3-step adversarial scheme that jointly uses a diffusion discriminator and GAN objectives. The approach enables high-quality inpainting with only four diffusion steps and introduces LocalCaptionData for targeted region prompts, along with DilationBench and HumanBench to assess performance under varied mask complexities and user-centric prompts. Empirical results show TurboFill outperforms both multi-step BrushNet and LoRA-accelerated few-step baselines in objective quality metrics and human preferences, while significantly reducing training and inference costs. The work offers a practical, scalable solution for fast, realistic inpainting in real-world workflows, with dedicated benchmarks to guide future improvements.

Abstract

This paper introduces TurboFill, a fast image inpainting model that enhances a few-step text-to-image diffusion model with an inpainting adapter for high-quality and efficient inpainting. While standard diffusion models generate high-quality results, they incur high computational costs. We overcome this by training an inpainting adapter on a few-step distilled text-to-image model, DMD2, using a novel 3-step adversarial training scheme to ensure realistic, structurally consistent, and visually harmonious inpainted regions. To evaluate TurboFill, we propose two benchmarks: DilationBench, which tests performance across mask sizes, and HumanBench, based on human feedback for complex prompts. Experiments show that TurboFill outperforms both multi-step BrushNet and few-step inpainting methods, setting a new benchmark for high-performance inpainting tasks. Our project page: https://liangbinxie.github.io/projects/TurboFill/

TurboFill: Adapting Few-step Text-to-image Model for Fast Image Inpainting

TL;DR

TurboFill addresses the high computational cost of diffusion-based image inpainting by training an inpainting adapter directly on a few-step diffusion model using a 3-step adversarial scheme that jointly uses a diffusion discriminator and GAN objectives. The approach enables high-quality inpainting with only four diffusion steps and introduces LocalCaptionData for targeted region prompts, along with DilationBench and HumanBench to assess performance under varied mask complexities and user-centric prompts. Empirical results show TurboFill outperforms both multi-step BrushNet and LoRA-accelerated few-step baselines in objective quality metrics and human preferences, while significantly reducing training and inference costs. The work offers a practical, scalable solution for fast, realistic inpainting in real-world workflows, with dedicated benchmarks to guide future improvements.

Abstract

This paper introduces TurboFill, a fast image inpainting model that enhances a few-step text-to-image diffusion model with an inpainting adapter for high-quality and efficient inpainting. While standard diffusion models generate high-quality results, they incur high computational costs. We overcome this by training an inpainting adapter on a few-step distilled text-to-image model, DMD2, using a novel 3-step adversarial training scheme to ensure realistic, structurally consistent, and visually harmonious inpainted regions. To evaluate TurboFill, we propose two benchmarks: DilationBench, which tests performance across mask sizes, and HumanBench, based on human feedback for complex prompts. Experiments show that TurboFill outperforms both multi-step BrushNet and few-step inpainting methods, setting a new benchmark for high-performance inpainting tasks. Our project page: https://liangbinxie.github.io/projects/TurboFill/

Paper Structure

This paper contains 27 sections, 8 equations, 14 figures, 6 tables.

Figures (14)

  • Figure 1: We propose TurboFill, a fast image inpainting method that leverages a 3-step adversarial training scheme. With only four diffusion steps, TurboFill outperforms the multi-step BrushNet* ju2024brushnet, delivering realistic details and textures with remarkable efficiency.
  • Figure 2: (Zoom in for best view) 1. The multi-step adapter achieves high-quality inpainting results but incurs significant inference costs, requiring over 50 diffusion steps. 2. Applying the pre-trained multi-step BrushNet adapter directly to the few-step U-Net (DMD2) results in artifacts, including oversaturated colors and semantic inconsistencies (e.g., generating a dog with two tails). 3. Training the adapter-DMD2 solely with diffusion loss produces blurred outputs with low-quality inpainting results. 4. In contrast, training the adapter-DMD2 using the proposed 3-step adversarial training scheme yields high-quality inpainting results, requiring only four diffusion steps.
  • Figure 3: The training of TurboFill alternates between 3 steps: . optimizing the adapter using the gradient of $\mathcal{L}^{R}_{\mathrm{Diff}}$, and . $\mathcal{L}_{\mathcal{G}}$ and $\mathcal{L}_{\mathrm{BG}}$ are employed to update the adapter in fast generator, and . $\mathcal{L}^{F}_{\mathrm{Diff}}$ and $\mathcal{L}_{\mathcal{D}}$ are jointly applied to update the parameters of the diffusion discriminator module. Note that the adapters in slow generator and fast generator share same weights during training.
  • Figure 4: Comparison of previous inpainting methods and BrushNet on DilationBench. Compared to other methods, TurboFill generates more realistic details and textures in just 4 steps, while achieving good scene harmonization. (Zoom in for best view)
  • Figure 5: The effectiveness of LocalCaptionData. All results are obtained based on 4-step DMD2. (Zoom in for best view)
  • ...and 9 more figures