Toward a Diffusion-Based Generalist for Dense Vision Tasks
Yue Fan, Yongqin Xian, Xiaohua Zhai, Alexander Kolesnikov, Muhammad Ferjad Naeem, Bernt Schiele, Federico Tombari
TL;DR
The paper tackles the challenge of building a single model capable of solving multiple dense vision tasks. It introduces a diffusion-based generalist that treats outputs as conditional RGB images and finetunes pre-trained diffusion models on reformatted task data. To avoid quantization artifacts from latent diffusion, it performs diffusion directly in the pixel space and uses image features from a frozen encoder plus text prompts for conditioning. Experiments across depth estimation, semantic segmentation, panoptic segmentation, and restoration tasks show competitive performance at a modest target resolution of 128×128 and provide a practical training recipe. The work highlights design choices, including image-conditioned conditioning, pre-training transfer, and the superiority of pixel-space diffusion for segmentation-driven tasks, while acknowledging memory costs and suggesting directions for efficiency improvements.
Abstract
Building generalized models that can solve many computer vision tasks simultaneously is an intriguing direction. Recent works have shown image itself can be used as a natural interface for general-purpose visual perception and demonstrated inspiring results. In this paper, we explore diffusion-based vision generalists, where we unify different types of dense prediction tasks as conditional image generation and re-purpose pre-trained diffusion models for it. However, directly applying off-the-shelf latent diffusion models leads to a quantization issue. Thus, we propose to perform diffusion in pixel space and provide a recipe for finetuning pre-trained text-to-image diffusion models for dense vision tasks. In experiments, we evaluate our method on four different types of tasks and show competitive performance to the other vision generalists.
