Table of Contents
Fetching ...

FoundIR: Unleashing Million-scale Training Data to Advance Foundation Models for Image Restoration

Hao Li, Xiang Chen, Jiangxin Dong, Jinhui Tang, Jinshan Pan

TL;DR

This work confronts the limited real-world generalization of universal image restoration by introducing a million-scale real-world paired dataset gathered with a robotic shooting system and a robust two-tier model, FoundIR. FoundIR combines a diffusion-based degradation-agnostic generalist with degradation-aware specialists in an ensemble, supported by an incremental learning strategy to scale with data while mitigating forgetting. Empirical results across 24 benchmarks and public datasets show state-of-the-art performance and strong generalization, highlighting the dataset’s value and the method’s effectiveness in handling diverse real-world degradations. The approach has practical implications for building more reliable foundation-like models in image restoration and sets a new direction for dataset scale and training strategies in this domain.

Abstract

Despite the significant progress made by all-in-one models in universal image restoration, existing methods suffer from a generalization bottleneck in real-world scenarios, as they are mostly trained on small-scale synthetic datasets with limited degradations. Therefore, large-scale high-quality real-world training data is urgently needed to facilitate the emergence of foundational models for image restoration. To advance this field, we spare no effort in contributing a million-scale dataset with two notable advantages over existing training data: real-world samples with larger-scale, and degradation types with higher diversity. By adjusting internal camera settings and external imaging conditions, we can capture aligned image pairs using our well-designed data acquisition system over multiple rounds and our data alignment criterion. Moreover, we propose a robust model, FoundIR, to better address a broader range of restoration tasks in real-world scenarios, taking a further step toward foundation models. Specifically, we first utilize a diffusion-based generalist model to remove degradations by learning the degradation-agnostic common representations from diverse inputs, where incremental learning strategy is adopted to better guide model training. To refine the model's restoration capability in complex scenarios, we introduce degradation-aware specialist models for achieving final high-quality results. Extensive experiments show the value of our dataset and the effectiveness of our method.

FoundIR: Unleashing Million-scale Training Data to Advance Foundation Models for Image Restoration

TL;DR

This work confronts the limited real-world generalization of universal image restoration by introducing a million-scale real-world paired dataset gathered with a robotic shooting system and a robust two-tier model, FoundIR. FoundIR combines a diffusion-based degradation-agnostic generalist with degradation-aware specialists in an ensemble, supported by an incremental learning strategy to scale with data while mitigating forgetting. Empirical results across 24 benchmarks and public datasets show state-of-the-art performance and strong generalization, highlighting the dataset’s value and the method’s effectiveness in handling diverse real-world degradations. The approach has practical implications for building more reliable foundation-like models in image restoration and sets a new direction for dataset scale and training strategies in this domain.

Abstract

Despite the significant progress made by all-in-one models in universal image restoration, existing methods suffer from a generalization bottleneck in real-world scenarios, as they are mostly trained on small-scale synthetic datasets with limited degradations. Therefore, large-scale high-quality real-world training data is urgently needed to facilitate the emergence of foundational models for image restoration. To advance this field, we spare no effort in contributing a million-scale dataset with two notable advantages over existing training data: real-world samples with larger-scale, and degradation types with higher diversity. By adjusting internal camera settings and external imaging conditions, we can capture aligned image pairs using our well-designed data acquisition system over multiple rounds and our data alignment criterion. Moreover, we propose a robust model, FoundIR, to better address a broader range of restoration tasks in real-world scenarios, taking a further step toward foundation models. Specifically, we first utilize a diffusion-based generalist model to remove degradations by learning the degradation-agnostic common representations from diverse inputs, where incremental learning strategy is adopted to better guide model training. To refine the model's restoration capability in complex scenarios, we introduce degradation-aware specialist models for achieving final high-quality results. Extensive experiments show the value of our dataset and the effectiveness of our method.

Paper Structure

This paper contains 15 sections, 1 equation, 6 figures, 3 tables, 1 algorithm.

Figures (6)

  • Figure 1: The potential of large-scale training data for universal image restoration. (a) Analysis of universal image restoration performance in real-world scenarios as training data vary. As the size of real-world training data increases, the image restoration model can achieve significant performance improvement. (b) Our proposed FoundIR, trained on our million-scale dataset, achieves state-of-the-art performance across a broad range of image restoration tasks compared to existing universal image restoration methods.
  • Figure 2: Illustration of our mechatronic shooting system used for capturing paired data. We move the camera on the electric slide rail from starting point $\mathbf{X}$ to ending point $\mathbf{Y}$. In the first round, we capture the GT data with the camera set to a fixed exposure time. Then, we capture the LQ data in the second round by adjusting the camera settings (e.g., ISO, aperture, and focus mode), and we capture the LQ data in the third round by changing the external imaging environments (e.g., rain or blocking the light source). The movement of the camera consists of (I) static phase, (II) accelerating phase, (III) uniform moving phase, and (IV) deceleration phase. Notably, we set up reference objects (i.e., marker) to assist with data alignment to obtain paired images from the uniform moving phase.
  • Figure 3: Illustration of the proposed million-scale dataset. (a) Our dataset outperforms existing universal image restoration datasets in terms of training data scale (see y-axis) and the diversity of degradation types (indicated by numbered circles). (b) Distribution of image degradation types of the proposed dataset, including 7 isolated degradation types and 13 coupled degradation types. (c) Example images from our dataset. We adjust internal camera settings (Round II) and external imaging conditions (Round III) to capture various degradation.
  • Figure 4: Illustration of the proposed FoundIR. We first employ a diffusion-based generalist model for degradation removal, followed by multiple specialist models for quality refinement. We guide the generalist model to learn a degradation-agnostic common representation space from various degraded inputs, where incremental learning is introduced to improve the model’s training stability. For the specialist models, we construct an expert pool to handle various scenarios, comprising text repair experts, weather experts, and illumination experts.
  • Figure 5: Visual comparisons on the isolated and coupled degradation inputs from the proposed benchmark. Zoom in for a better view.
  • ...and 1 more figures