Acquire and then Adapt: Squeezing out Text-to-Image Model for Image Restoration
Junyuan Deng, Xinyi Wu, Yongxing Yang, Congchao Zhu, Song Wang, Zhenyao Wu
TL;DR
This work tackles real-world image restoration by minimizing data and compute barriers through FluxGen, a pipeline that distills high-quality IR training data from a pre-trained Flux T2I diffusion model, and FluxIR, a lightweight, multi-modality adapter that controls Flux across all MM-DiT blocks using SE layers and learnable text embeddings. By coupling efficient data generation with an optimized training strategy (timestep sampling and pixel-space loss) and broadcasting control signals via SE modules, the approach achieves superior restoration quality on synthetic and real degradations while reducing training cost dramatically. The key contributions are the data-distillation workflow without external data, a 0.4B-parameter, ControlNet-like FluxIR adapter, and extensive ablations confirming the importance of per-block control and multi-modality signals. The results indicate strong practical potential for privatized, scalable use of large T2I priors in image restoration, with significant cost savings and improved texture detail.
Abstract
Recently, pre-trained text-to-image (T2I) models have been extensively adopted for real-world image restoration because of their powerful generative prior. However, controlling these large models for image restoration usually requires a large number of high-quality images and immense computational resources for training, which is costly and not privacy-friendly. In this paper, we find that the well-trained large T2I model (i.e., Flux) is able to produce a variety of high-quality images aligned with real-world distributions, offering an unlimited supply of training samples to mitigate the above issue. Specifically, we proposed a training data construction pipeline for image restoration, namely FluxGen, which includes unconditional image generation, image selection, and degraded image simulation. A novel light-weighted adapter (FluxIR) with squeeze-and-excitation layers is also carefully designed to control the large Diffusion Transformer (DiT)-based T2I model so that reasonable details can be restored. Experiments demonstrate that our proposed method enables the Flux model to adapt effectively to real-world image restoration tasks, achieving superior scores and visual quality on both synthetic and real-world degradation datasets - at only about 8.5\% of the training cost compared to current approaches.
