Table of Contents
Fetching ...

Instruct-IPT: All-in-One Image Processing Transformer via Weight Modulation

Yuchuan Tian, Jianhong Han, Hanting Chen, Yuanyuan Xi, Ning Ding, Jie Hu, Chao Xu, Yunhe Wang

TL;DR

The paper tackles the challenge of jointly handling diverse image restoration tasks with a single IPT backbone, addressing the limitations of prior All-in-One models. It introduces Instruct-IPT, which conducts weight-based adaptation through task-specific, low-rank biases and employs synchronous training to learn general and task-specific knowledge, optionally guided by human language. Key contributions include empirical evidence that feature adaptation struggles with dissimilar tasks, a parameter-efficient weight modulation strategy with constant-rank biases, and state-of-the-art performance across five restoration tasks, plus extension to diffusion models. The work offers a practical, versatile framework for multi-task image restoration that can be guided by prompts and applied to diffusion workflows, potentially broadening real-world applicability.

Abstract

Due to the unaffordable size and intensive computation costs of low-level vision models, All-in-One models that are designed to address a handful of low-level vision tasks simultaneously have been popular. However, existing All-in-One models are limited in terms of the range of tasks and performance. To overcome these limitations, we propose Instruct-IPT -- an All-in-One Image Processing Transformer (IPT) that could effectively address manifold image restoration tasks with large inter-task gaps, such as denoising, deblurring, deraining, dehazing, and desnowing. While most research propose feature adaptation methods, we reveal their failure in addressing highly distinct tasks, and suggest weight modulation that adapts weights to specific tasks. Firstly, we search for task-sensitive weights and introduce task-specific biases on top of them. Secondly, we conduct rank analysis for a good compression strategy and perform low-rank decomposition on the biases. Thirdly, we propose synchronous training that updates the task-general backbone model and the task-specific biases simultaneously. In this way, the model is instructed to learn both general and task-specific knowledge. Via our simple yet effective method that instructs the IPT to be task experts, Instruct-IPT could better cooperate between tasks with distinct characteristics at humble costs. As an additional feature, we enable Instruct-IPT to receive human prompts. We have conducted experiments on Instruct-IPT to demonstrate the effectiveness of our method on manifold tasks, and we have effectively extended our method to diffusion denoisers as well. The code is available at https://github.com/huawei-noah/Pretrained-IPT.

Instruct-IPT: All-in-One Image Processing Transformer via Weight Modulation

TL;DR

The paper tackles the challenge of jointly handling diverse image restoration tasks with a single IPT backbone, addressing the limitations of prior All-in-One models. It introduces Instruct-IPT, which conducts weight-based adaptation through task-specific, low-rank biases and employs synchronous training to learn general and task-specific knowledge, optionally guided by human language. Key contributions include empirical evidence that feature adaptation struggles with dissimilar tasks, a parameter-efficient weight modulation strategy with constant-rank biases, and state-of-the-art performance across five restoration tasks, plus extension to diffusion models. The work offers a practical, versatile framework for multi-task image restoration that can be guided by prompts and applied to diffusion workflows, potentially broadening real-world applicability.

Abstract

Due to the unaffordable size and intensive computation costs of low-level vision models, All-in-One models that are designed to address a handful of low-level vision tasks simultaneously have been popular. However, existing All-in-One models are limited in terms of the range of tasks and performance. To overcome these limitations, we propose Instruct-IPT -- an All-in-One Image Processing Transformer (IPT) that could effectively address manifold image restoration tasks with large inter-task gaps, such as denoising, deblurring, deraining, dehazing, and desnowing. While most research propose feature adaptation methods, we reveal their failure in addressing highly distinct tasks, and suggest weight modulation that adapts weights to specific tasks. Firstly, we search for task-sensitive weights and introduce task-specific biases on top of them. Secondly, we conduct rank analysis for a good compression strategy and perform low-rank decomposition on the biases. Thirdly, we propose synchronous training that updates the task-general backbone model and the task-specific biases simultaneously. In this way, the model is instructed to learn both general and task-specific knowledge. Via our simple yet effective method that instructs the IPT to be task experts, Instruct-IPT could better cooperate between tasks with distinct characteristics at humble costs. As an additional feature, we enable Instruct-IPT to receive human prompts. We have conducted experiments on Instruct-IPT to demonstrate the effectiveness of our method on manifold tasks, and we have effectively extended our method to diffusion denoisers as well. The code is available at https://github.com/huawei-noah/Pretrained-IPT.
Paper Structure (12 sections, 3 equations, 5 figures, 11 tables)

This paper contains 12 sections, 3 equations, 5 figures, 11 tables.

Figures (5)

  • Figure 1: Framework of Instruct-IPT. Thanks to the proposed weight modulation method, Instruct-IPT performs well on a wide range of tasks. Weight modulation involves adding task-specific biases (which is low-rank decomposited) to a general backbone. Synchronous training is performed where both the backbone and the bias are updated simultaneously, such that task-specific knowledge is automatically extracted. Text instructions could be provided to command the model.
  • Figure 2: PCA accumulative energy under different rank strategies across layers. The shade of the background color indicates the depth of the U-Net stage. The constant rank strategy is better than proportional strategy in covering the overall information of biases.
  • Figure 3: A demo of Instruct-IPT instructed by human language. Our method could achieve good image restoration results on various tasks while responding to human language.
  • Figure 4: Qualitative comparisons between Instruct-IPT and competitive baselines. We compare the two methods on three tasks: denoising ($\sigma=50$), deraining, and denoising. Our Instruct-IPT could outcompete baselines by large margins in terms of visual quality.
  • Figure 5: Qualitative comparisons of several methods for diffusion models. We compare our method with three other baselines on generative tasks: inpainting (the first two rows) and outpainting (the last two rows). Our method generates images with greater logical consistency and realism.