Table of Contents
Fetching ...

Acquire and then Adapt: Squeezing out Text-to-Image Model for Image Restoration

Junyuan Deng, Xinyi Wu, Yongxing Yang, Congchao Zhu, Song Wang, Zhenyao Wu

TL;DR

This work tackles real-world image restoration by minimizing data and compute barriers through FluxGen, a pipeline that distills high-quality IR training data from a pre-trained Flux T2I diffusion model, and FluxIR, a lightweight, multi-modality adapter that controls Flux across all MM-DiT blocks using SE layers and learnable text embeddings. By coupling efficient data generation with an optimized training strategy (timestep sampling and pixel-space loss) and broadcasting control signals via SE modules, the approach achieves superior restoration quality on synthetic and real degradations while reducing training cost dramatically. The key contributions are the data-distillation workflow without external data, a 0.4B-parameter, ControlNet-like FluxIR adapter, and extensive ablations confirming the importance of per-block control and multi-modality signals. The results indicate strong practical potential for privatized, scalable use of large T2I priors in image restoration, with significant cost savings and improved texture detail.

Abstract

Recently, pre-trained text-to-image (T2I) models have been extensively adopted for real-world image restoration because of their powerful generative prior. However, controlling these large models for image restoration usually requires a large number of high-quality images and immense computational resources for training, which is costly and not privacy-friendly. In this paper, we find that the well-trained large T2I model (i.e., Flux) is able to produce a variety of high-quality images aligned with real-world distributions, offering an unlimited supply of training samples to mitigate the above issue. Specifically, we proposed a training data construction pipeline for image restoration, namely FluxGen, which includes unconditional image generation, image selection, and degraded image simulation. A novel light-weighted adapter (FluxIR) with squeeze-and-excitation layers is also carefully designed to control the large Diffusion Transformer (DiT)-based T2I model so that reasonable details can be restored. Experiments demonstrate that our proposed method enables the Flux model to adapt effectively to real-world image restoration tasks, achieving superior scores and visual quality on both synthetic and real-world degradation datasets - at only about 8.5\% of the training cost compared to current approaches.

Acquire and then Adapt: Squeezing out Text-to-Image Model for Image Restoration

TL;DR

This work tackles real-world image restoration by minimizing data and compute barriers through FluxGen, a pipeline that distills high-quality IR training data from a pre-trained Flux T2I diffusion model, and FluxIR, a lightweight, multi-modality adapter that controls Flux across all MM-DiT blocks using SE layers and learnable text embeddings. By coupling efficient data generation with an optimized training strategy (timestep sampling and pixel-space loss) and broadcasting control signals via SE modules, the approach achieves superior restoration quality on synthetic and real degradations while reducing training cost dramatically. The key contributions are the data-distillation workflow without external data, a 0.4B-parameter, ControlNet-like FluxIR adapter, and extensive ablations confirming the importance of per-block control and multi-modality signals. The results indicate strong practical potential for privatized, scalable use of large T2I priors in image restoration, with significant cost savings and improved texture detail.

Abstract

Recently, pre-trained text-to-image (T2I) models have been extensively adopted for real-world image restoration because of their powerful generative prior. However, controlling these large models for image restoration usually requires a large number of high-quality images and immense computational resources for training, which is costly and not privacy-friendly. In this paper, we find that the well-trained large T2I model (i.e., Flux) is able to produce a variety of high-quality images aligned with real-world distributions, offering an unlimited supply of training samples to mitigate the above issue. Specifically, we proposed a training data construction pipeline for image restoration, namely FluxGen, which includes unconditional image generation, image selection, and degraded image simulation. A novel light-weighted adapter (FluxIR) with squeeze-and-excitation layers is also carefully designed to control the large Diffusion Transformer (DiT)-based T2I model so that reasonable details can be restored. Experiments demonstrate that our proposed method enables the Flux model to adapt effectively to real-world image restoration tasks, achieving superior scores and visual quality on both synthetic and real-world degradation datasets - at only about 8.5\% of the training cost compared to current approaches.

Paper Structure

This paper contains 16 sections, 8 equations, 16 figures, 7 tables.

Figures (16)

  • Figure 1: Comparison of SUPIR yu2024scaling, DreamClear ai2024dreamclear, and our proposed method. Our training dataset is constructed entirely from synthetic images. Trained with such data, our method achieves the most realistic restoration results with the lowest training cost.
  • Figure 2: An overview of our FluxGen pipeline. First, an empty prompt and random Gaussian noise $z_1$ are input into Flux, generating an image latent $z_0$ over $T$ steps. A VAE decoder then maps $z_0$ to its corresponding image $x_0$. High-quality images are curated by IQA-based selection, followed by image degradation to construct the final paired dataset.
  • Figure 3: Training and inference pipeline of the proposed FluxIR. FluxIR employs a single MM-DiT block with learnable T5 embedding $\theta_p$ and CLIP embedding $\theta_y$ to extract image feature $f_z$ and text feature $f_p$ from the low-quality control latent $z_{lq}$. The squeeze-and-excitation (SE) layers ($\text{SE}_z(\cdot)$ for image and $\text{SE}_p(\cdot)$ for text) broadcast these features to all Flux MM-DiT blocks to enable precise and multi-modality control.
  • Figure 4: Qualitative comparison on the synthetic dataset DIV2K-Val.
  • Figure 5: Qualitative comparison on the real-world dataset RealLQ250.
  • ...and 11 more figures