Table of Contents
Fetching ...

UniProcessor: A Text-induced Unified Low-level Image Processor

Huiyu Duan, Xiongkuo Min, Sijing Wu, Wei Shen, Guangtao Zhai

TL;DR

UniProcessor addresses the challenge of handling diverse image degradations with a single model by introducing text-induced, degradation-aware context control. It combines an instruction-tuned low-level vision-language VQA module to generate subject prompts with a multimodal representation learning pipeline, and a context-controlled processor backbone that injects guidance via cross-attention using subject and manipulation prompts. The approach achieves state-of-the-art performance across 30 degradations, supports separate or sequential processing of multiple degradations, and enables automatic or user-driven control, demonstrating strong generalization and practical utility for real-world low-level image restoration. The work advances toward flexible, interpretable, and scalable all-in-one restoration systems that can adapt to unseen degradations and user intents with natural-language prompts.

Abstract

Image processing, including image restoration, image enhancement, etc., involves generating a high-quality clean image from a degraded input. Deep learning-based methods have shown superior performance for various image processing tasks in terms of single-task conditions. However, they require to train separate models for different degradations and levels, which limits the generalization abilities of these models and restricts their applications in real-world. In this paper, we propose a text-induced unified image processor for low-level vision tasks, termed UniProcessor, which can effectively process various degradation types and levels, and support multimodal control. Specifically, our UniProcessor encodes degradation-specific information with the subject prompt and process degradations with the manipulation prompt. These context control features are injected into the UniProcessor backbone via cross-attention to control the processing procedure. For automatic subject-prompt generation, we further build a vision-language model for general-purpose low-level degradation perception via instruction tuning techniques. Our UniProcessor covers 30 degradation types, and extensive experiments demonstrate that our UniProcessor can well process these degradations without additional training or tuning and outperforms other competing methods. Moreover, with the help of degradation-aware context control, our UniProcessor first shows the ability to individually handle a single distortion in an image with multiple degradations.

UniProcessor: A Text-induced Unified Low-level Image Processor

TL;DR

UniProcessor addresses the challenge of handling diverse image degradations with a single model by introducing text-induced, degradation-aware context control. It combines an instruction-tuned low-level vision-language VQA module to generate subject prompts with a multimodal representation learning pipeline, and a context-controlled processor backbone that injects guidance via cross-attention using subject and manipulation prompts. The approach achieves state-of-the-art performance across 30 degradations, supports separate or sequential processing of multiple degradations, and enables automatic or user-driven control, demonstrating strong generalization and practical utility for real-world low-level image restoration. The work advances toward flexible, interpretable, and scalable all-in-one restoration systems that can adapt to unseen degradations and user intents with natural-language prompts.

Abstract

Image processing, including image restoration, image enhancement, etc., involves generating a high-quality clean image from a degraded input. Deep learning-based methods have shown superior performance for various image processing tasks in terms of single-task conditions. However, they require to train separate models for different degradations and levels, which limits the generalization abilities of these models and restricts their applications in real-world. In this paper, we propose a text-induced unified image processor for low-level vision tasks, termed UniProcessor, which can effectively process various degradation types and levels, and support multimodal control. Specifically, our UniProcessor encodes degradation-specific information with the subject prompt and process degradations with the manipulation prompt. These context control features are injected into the UniProcessor backbone via cross-attention to control the processing procedure. For automatic subject-prompt generation, we further build a vision-language model for general-purpose low-level degradation perception via instruction tuning techniques. Our UniProcessor covers 30 degradation types, and extensive experiments demonstrate that our UniProcessor can well process these degradations without additional training or tuning and outperforms other competing methods. Moreover, with the help of degradation-aware context control, our UniProcessor first shows the ability to individually handle a single distortion in an image with multiple degradations.
Paper Structure (31 sections, 2 equations, 6 figures, 9 tables)

This paper contains 31 sections, 2 equations, 6 figures, 9 tables.

Figures (6)

  • Figure 1: UniProcessor is capable of processing various degradations in one model with text control. (a) For single degradation, UniProcessor can well restore images. (b) For an image with multiple degradations, our UniProcessor can process individual distortion with text control, which demonstrates the superior distortion perception and disentangling abilities. (c) For images with multiple degradations, Uniprocessor can process each degradation step by step to restore or enhance the images.
  • Figure 2: An illustration of the overview and the examples of UniProcessor. (a) An overview of the proposed UniProcessor. (i) Our UniProcessor first learns low-level vision-language model via instruction tuning, which can adapt to various degradation-aware visual questions and generate the subject prompt. (ii) The subject prompt and the extracted input image embedding are encoded to obtain the subject prompt embedding, which is then combined with the manipulation prompt to obtain the context control embedding. (iii) The guidance information is injected into the Processor backbone at multiple decoding stages. (b) An illustration of the examples generated by Uniprocessor, which demonstrates the good control ability and degradation disentangling capability.
  • Figure 3: An overview of the Processor backbone. (a) The architecture of the Processor backbone. (b) The illustration of a Transformer block. (c) The illustration of a ConvFormer block. (d) The illustration of the ConvBlock. (e) The illustration of the Gated Conv Feed-Forward Network (GCFFN). (f) The demonstration of the Context Interaction Module (CIM). LN indicates a LayerNorm layer. CA is a channel-attention layer. G-MSA represents the global multi-head self-attention. GRN means the global response normalization.
  • Figure 4: Visualization results on 5 different degradation types. UniProcessor produces more visually pleasant results.
  • Figure 5: tSNE plots of the degradation embeddings in UniProcessor (ours) and the state-of-the-art model PromptIR potlapalli2023promptir. Our results are better clustered, manifesting the effectiveness of text-induced prompt method for learning discriminative degradation context.
  • ...and 1 more figures