Table of Contents
Fetching ...

Unifying Image Processing as Visual Prompting Question Answering

Yihao Liu, Xiangyu Chen, Xianzheng Ma, Xintao Wang, Jiantao Zhou, Yu Qiao, Chao Dong

TL;DR

PromptGIP addresses the fragmented landscape of low-level image processing by proposing a universal model that leverages a visual prompting question answering paradigm to unify restoration, enhancement, and edge detection tasks. The method encodes task prompts as input-output image pairs and trains with a masked autoencoding objective inside a Q-A-Q-A sequence, enabling task-conditioned inference without task-specific finetuning. Empirical results demonstrate that PromptGIP performs across 15 tasks with a ViT-large backbone, often surpassing task-specific or prior all-in-one restoration methods on several degradations and achieving notable cross-domain flexibility. This work suggests a path toward general-purpose, foundation-model-like capabilities in low-level vision, bridging ideas from NLP prompting to pixel-level image processing.

Abstract

Image processing is a fundamental task in computer vision, which aims at enhancing image quality and extracting essential features for subsequent vision applications. Traditionally, task-specific models are developed for individual tasks and designing such models requires distinct expertise. Building upon the success of large language models (LLMs) in natural language processing (NLP), there is a similar trend in computer vision, which focuses on developing large-scale models through pretraining and in-context learning. This paradigm shift reduces the reliance on task-specific models, yielding a powerful unified model to deal with various tasks. However, these advances have predominantly concentrated on high-level vision tasks, with less attention paid to low-level vision tasks. To address this issue, we propose a universal model for general image processing that covers image restoration, image enhancement, image feature extraction tasks, etc. Our proposed framework, named PromptGIP, unifies these diverse image processing tasks within a universal framework. Inspired by NLP question answering (QA) techniques, we employ a visual prompting question answering paradigm. Specifically, we treat the input-output image pair as a structured question-answer sentence, thereby reprogramming the image processing task as a prompting QA problem. PromptGIP can undertake diverse cross-domain tasks using provided visual prompts, eliminating the need for task-specific finetuning. Our methodology offers a universal and adaptive solution to general image processing. While PromptGIP has demonstrated a certain degree of out-of-domain task generalization capability, further research is expected to fully explore its more powerful emergent generalization.

Unifying Image Processing as Visual Prompting Question Answering

TL;DR

PromptGIP addresses the fragmented landscape of low-level image processing by proposing a universal model that leverages a visual prompting question answering paradigm to unify restoration, enhancement, and edge detection tasks. The method encodes task prompts as input-output image pairs and trains with a masked autoencoding objective inside a Q-A-Q-A sequence, enabling task-conditioned inference without task-specific finetuning. Empirical results demonstrate that PromptGIP performs across 15 tasks with a ViT-large backbone, often surpassing task-specific or prior all-in-one restoration methods on several degradations and achieving notable cross-domain flexibility. This work suggests a path toward general-purpose, foundation-model-like capabilities in low-level vision, bridging ideas from NLP prompting to pixel-level image processing.

Abstract

Image processing is a fundamental task in computer vision, which aims at enhancing image quality and extracting essential features for subsequent vision applications. Traditionally, task-specific models are developed for individual tasks and designing such models requires distinct expertise. Building upon the success of large language models (LLMs) in natural language processing (NLP), there is a similar trend in computer vision, which focuses on developing large-scale models through pretraining and in-context learning. This paradigm shift reduces the reliance on task-specific models, yielding a powerful unified model to deal with various tasks. However, these advances have predominantly concentrated on high-level vision tasks, with less attention paid to low-level vision tasks. To address this issue, we propose a universal model for general image processing that covers image restoration, image enhancement, image feature extraction tasks, etc. Our proposed framework, named PromptGIP, unifies these diverse image processing tasks within a universal framework. Inspired by NLP question answering (QA) techniques, we employ a visual prompting question answering paradigm. Specifically, we treat the input-output image pair as a structured question-answer sentence, thereby reprogramming the image processing task as a prompting QA problem. PromptGIP can undertake diverse cross-domain tasks using provided visual prompts, eliminating the need for task-specific finetuning. Our methodology offers a universal and adaptive solution to general image processing. While PromptGIP has demonstrated a certain degree of out-of-domain task generalization capability, further research is expected to fully explore its more powerful emergent generalization.
Paper Structure (21 sections, 1 equation, 12 figures, 5 tables)

This paper contains 21 sections, 1 equation, 12 figures, 5 tables.

Figures (12)

  • Figure 1: PromptGIP is a universal framework for general image processing. It can accomplish diverse tasks with distinct output domains, including image restoration, enhancement and edge detection. It has demonstrated a certain level of generalization for out-of-domain tasks (marked in dashed lines).
  • Figure 2: Analogous to NLP tasks, various image processing tasks can be unified into a general visual prompting QA paradigm: given a pair of image prompt, the model can process the query image based on the prompts. MAE-VQGAN fragments image tokens and arrange them in an interleaved fashion. It disrupts the continuity and contextual understanding of the image content. Painter adopts a Q-Q-A-A organizational structure, which is not aligned with the QA paradigm. This misalignment can lead to inefficiencies in learning.
  • Figure 3: We structure the input and output images as a "Q-A-Q-A" sequence. During training, the answer images (A) are randonly masked and predicted. For inference, PromptGIP can execute proper processing to the question image according to the prompt pairs.
  • Figure 4: The drawbacks of existing methods. MAE-VQGAN fails to produce high-quality images. The prompts of Painter do not actually work well.
  • Figure 5: Visual results of PromptGIP on all-in-one multi-task restoration.
  • ...and 7 more figures