Learning A Low-Level Vision Generalist via Visual Task Prompt

Xiangyu Chen; Yihao Liu; Yuandong Pu; Wenlong Zhang; Jiantao Zhou; Yu Qiao; Chao Dong

Learning A Low-Level Vision Generalist via Visual Task Prompt

Xiangyu Chen, Yihao Liu, Yuandong Pu, Wenlong Zhang, Jiantao Zhou, Yu Qiao, Chao Dong

TL;DR

This work tackles the lack of a unified model for diverse low-level vision tasks by introducing the Visual task Prompt-based Image Processing (VPIP) framework and the low-level vision generalist GenLV. VPIP uses visual prompts to represent varying input-target domains and employs a prompt cross-attention mechanism to fuse task information with image features, enabling flexible backbone selection beyond MAE-based ViT dependence. Trained on 30 tasks spanning restoration, enhancement, edge detection, and stylization, GenLV consistently outperforms task-specific and multi-task baselines in reconstruction quality and visual fidelity. The approach advances practical cross-domain low-level vision, offering a scalable route toward robust generalist capabilities in real-world applications, with future work aimed at scaling data and model size to cover more out-of-distribution tasks.

Abstract

Building a unified model for general low-level vision tasks holds significant research and practical value. Current methods encounter several critical issues. Multi-task restoration approaches can address multiple degradation-to-clean restoration tasks, while their applicability to tasks with different target domains (e.g., image stylization) is limited. Methods like PromptGIP can handle multiple input-target domains but rely on the Masked Autoencoder (MAE) paradigm. Consequently, they are tied to the ViT architecture, resulting in suboptimal image reconstruction quality. In addition, these methods are sensitive to prompt image content and often struggle with low-frequency information processing. In this paper, we propose a Visual task Prompt-based Image Processing (VPIP) framework to overcome these challenges. VPIP employs visual task prompts to manage tasks with different input-target domains and allows flexible selection of backbone network suitable for general tasks. Besides, a new prompt cross-attention is introduced to facilitate interaction between the input and prompt information. Based on the VPIP framework, we train a low-level vision generalist model, namely GenLV, on 30 diverse tasks. Experimental results show that GenLV can successfully address a variety of low-level tasks, significantly outperforming existing methods both quantitatively and qualitatively. Codes are available at https://github.com/chxy95/GenLV.

Learning A Low-Level Vision Generalist via Visual Task Prompt

TL;DR

Abstract

Paper Structure (18 sections, 3 equations, 10 figures, 8 tables)

This paper contains 18 sections, 3 equations, 10 figures, 8 tables.

Introduction
Related Work
Approach
Representative Low-Level Vision Tasks
Problem Formulation
Low-Level Vision Generalist Model
Experiments and Analysis
Experimental Setup
Quantitative Results
Visual Results
Exploration of Task Prompt
limitations and Prospects
Conclusion
Exploration on Different Image Restoration Backbone Networks
Exploration on Different Prompt Interaction mechanisms
...and 3 more sections

Figures (10)

Figure 1: Our proposed low-level vision generalist model, GenLV, can handle diverse tasks with various input/target domains.
Figure 2: Diverse low-level vision tasks. Different categories of tasks differ in terms of target domains. It presents a significant challenge to build a low-level vision generalist model.
Figure 3: Overall approach of our low-level vision generalist model, GenLV.
Figure 4: Comparison of two attention mechanisms.
Figure 5: Visual results of different models on various low-level vision tasks.
...and 5 more figures

Learning A Low-Level Vision Generalist via Visual Task Prompt

TL;DR

Abstract

Learning A Low-Level Vision Generalist via Visual Task Prompt

Authors

TL;DR

Abstract

Table of Contents

Figures (10)