Unifying Image Processing as Visual Prompting Question Answering

Yihao Liu; Xiangyu Chen; Xianzheng Ma; Xintao Wang; Jiantao Zhou; Yu Qiao; Chao Dong

Unifying Image Processing as Visual Prompting Question Answering

Yihao Liu, Xiangyu Chen, Xianzheng Ma, Xintao Wang, Jiantao Zhou, Yu Qiao, Chao Dong

TL;DR

PromptGIP addresses the fragmented landscape of low-level image processing by proposing a universal model that leverages a visual prompting question answering paradigm to unify restoration, enhancement, and edge detection tasks. The method encodes task prompts as input-output image pairs and trains with a masked autoencoding objective inside a Q-A-Q-A sequence, enabling task-conditioned inference without task-specific finetuning. Empirical results demonstrate that PromptGIP performs across 15 tasks with a ViT-large backbone, often surpassing task-specific or prior all-in-one restoration methods on several degradations and achieving notable cross-domain flexibility. This work suggests a path toward general-purpose, foundation-model-like capabilities in low-level vision, bridging ideas from NLP prompting to pixel-level image processing.

Abstract

Image processing is a fundamental task in computer vision, which aims at enhancing image quality and extracting essential features for subsequent vision applications. Traditionally, task-specific models are developed for individual tasks and designing such models requires distinct expertise. Building upon the success of large language models (LLMs) in natural language processing (NLP), there is a similar trend in computer vision, which focuses on developing large-scale models through pretraining and in-context learning. This paradigm shift reduces the reliance on task-specific models, yielding a powerful unified model to deal with various tasks. However, these advances have predominantly concentrated on high-level vision tasks, with less attention paid to low-level vision tasks. To address this issue, we propose a universal model for general image processing that covers image restoration, image enhancement, image feature extraction tasks, etc. Our proposed framework, named PromptGIP, unifies these diverse image processing tasks within a universal framework. Inspired by NLP question answering (QA) techniques, we employ a visual prompting question answering paradigm. Specifically, we treat the input-output image pair as a structured question-answer sentence, thereby reprogramming the image processing task as a prompting QA problem. PromptGIP can undertake diverse cross-domain tasks using provided visual prompts, eliminating the need for task-specific finetuning. Our methodology offers a universal and adaptive solution to general image processing. While PromptGIP has demonstrated a certain degree of out-of-domain task generalization capability, further research is expected to fully explore its more powerful emergent generalization.

Unifying Image Processing as Visual Prompting Question Answering

TL;DR

Abstract

Paper Structure (21 sections, 1 equation, 12 figures, 5 tables)

This paper contains 21 sections, 1 equation, 12 figures, 5 tables.

Introduction
Related Work
Method
Image Processing as Visual Question Answering
Masked Visual Prompting Paradigm
Further Discussion
Experiments and Analysis
Image Processing Task Settings
Implementation Details
Experiments
Conclusion
Broader Impact
Details of Image Processing Tasks
Image Restoration Tasks
Image Enhancement Tasks
...and 6 more sections

Figures (12)

Figure 1: PromptGIP is a universal framework for general image processing. It can accomplish diverse tasks with distinct output domains, including image restoration, enhancement and edge detection. It has demonstrated a certain level of generalization for out-of-domain tasks (marked in dashed lines).
Figure 2: Analogous to NLP tasks, various image processing tasks can be unified into a general visual prompting QA paradigm: given a pair of image prompt, the model can process the query image based on the prompts. MAE-VQGAN fragments image tokens and arrange them in an interleaved fashion. It disrupts the continuity and contextual understanding of the image content. Painter adopts a Q-Q-A-A organizational structure, which is not aligned with the QA paradigm. This misalignment can lead to inefficiencies in learning.
Figure 3: We structure the input and output images as a "Q-A-Q-A" sequence. During training, the answer images (A) are randonly masked and predicted. For inference, PromptGIP can execute proper processing to the question image according to the prompt pairs.
Figure 4: The drawbacks of existing methods. MAE-VQGAN fails to produce high-quality images. The prompts of Painter do not actually work well.
Figure 5: Visual results of PromptGIP on all-in-one multi-task restoration.
...and 7 more figures

Unifying Image Processing as Visual Prompting Question Answering

TL;DR

Abstract

Unifying Image Processing as Visual Prompting Question Answering

Authors

TL;DR

Abstract

Table of Contents

Figures (12)