Table of Contents
Fetching ...

Toward a Diffusion-Based Generalist for Dense Vision Tasks

Yue Fan, Yongqin Xian, Xiaohua Zhai, Alexander Kolesnikov, Muhammad Ferjad Naeem, Bernt Schiele, Federico Tombari

TL;DR

The paper tackles the challenge of building a single model capable of solving multiple dense vision tasks. It introduces a diffusion-based generalist that treats outputs as conditional RGB images and finetunes pre-trained diffusion models on reformatted task data. To avoid quantization artifacts from latent diffusion, it performs diffusion directly in the pixel space and uses image features from a frozen encoder plus text prompts for conditioning. Experiments across depth estimation, semantic segmentation, panoptic segmentation, and restoration tasks show competitive performance at a modest target resolution of 128×128 and provide a practical training recipe. The work highlights design choices, including image-conditioned conditioning, pre-training transfer, and the superiority of pixel-space diffusion for segmentation-driven tasks, while acknowledging memory costs and suggesting directions for efficiency improvements.

Abstract

Building generalized models that can solve many computer vision tasks simultaneously is an intriguing direction. Recent works have shown image itself can be used as a natural interface for general-purpose visual perception and demonstrated inspiring results. In this paper, we explore diffusion-based vision generalists, where we unify different types of dense prediction tasks as conditional image generation and re-purpose pre-trained diffusion models for it. However, directly applying off-the-shelf latent diffusion models leads to a quantization issue. Thus, we propose to perform diffusion in pixel space and provide a recipe for finetuning pre-trained text-to-image diffusion models for dense vision tasks. In experiments, we evaluate our method on four different types of tasks and show competitive performance to the other vision generalists.

Toward a Diffusion-Based Generalist for Dense Vision Tasks

TL;DR

The paper tackles the challenge of building a single model capable of solving multiple dense vision tasks. It introduces a diffusion-based generalist that treats outputs as conditional RGB images and finetunes pre-trained diffusion models on reformatted task data. To avoid quantization artifacts from latent diffusion, it performs diffusion directly in the pixel space and uses image features from a frozen encoder plus text prompts for conditioning. Experiments across depth estimation, semantic segmentation, panoptic segmentation, and restoration tasks show competitive performance at a modest target resolution of 128×128 and provide a practical training recipe. The work highlights design choices, including image-conditioned conditioning, pre-training transfer, and the superiority of pixel-space diffusion for segmentation-driven tasks, while acknowledging memory costs and suggesting directions for efficiency improvements.

Abstract

Building generalized models that can solve many computer vision tasks simultaneously is an intriguing direction. Recent works have shown image itself can be used as a natural interface for general-purpose visual perception and demonstrated inspiring results. In this paper, we explore diffusion-based vision generalists, where we unify different types of dense prediction tasks as conditional image generation and re-purpose pre-trained diffusion models for it. However, directly applying off-the-shelf latent diffusion models leads to a quantization issue. Thus, we propose to perform diffusion in pixel space and provide a recipe for finetuning pre-trained text-to-image diffusion models for dense vision tasks. In experiments, we evaluate our method on four different types of tasks and show competitive performance to the other vision generalists.
Paper Structure (12 sections, 3 figures, 6 tables)

This paper contains 12 sections, 3 figures, 6 tables.

Figures (3)

  • Figure 1: We present a diffusion-based vision generalist for dense vision tasks. Given an input image, the model performs the corresponding task following the text instruction. We showcase the effectiveness of our model on depth estimation, semantic segmentation, panoptic segmentation, and three types of image restoration tasks. The images are the actual output of our model.
  • Figure 2: The training pipeline of the diffusion-based vision generalist consists of two parts: Left: Redefining the output space of different vision tasks as RGB images so that they can be unified under a conditional image generation framework. Right: We finetune a pre-trained diffusion model on the reformatted data from the first step. Diffusion is performed in the pixel space to mitigate the quantization error of the latent diffusion (see Table \ref{['tab:latent_vs_pixel']}). The image and text conditionings are fed into the model via the corresponding encoders, where only the image encoder is tuned during the training.
  • Figure 3: Qualitative results on images from the validation sets of ADE20K, MS-COCO, NYU-V2, SIDD, Deraining, and LOL. Following a raster scan order, the text prompts are "Performance semantic segmentation", "Performance instance segmentation", "Performance depth estimation", "Performance image restoration denoising", "Performance image restoration deraining", and "Performance image restoration light enhancement", respectively. The images are not cherry-picked.