Table of Contents
Fetching ...

Implicit and Explicit Language Guidance for Diffusion-based Visual Perception

Hefeng Wang, Jiale Cao, Jin Xie, Aiping Yang, Yanwei Pang

TL;DR

This paper proposes an implicit and explicit language guidance framework for diffusion-based visual perception, named IEDP, which comprises an implicit language guidance branch and an explicit language guidance branch that can jointly guide feature learning.

Abstract

Text-to-image diffusion models have shown powerful ability on conditional image synthesis. With large-scale vision-language pre-training, diffusion models are able to generate high-quality images with rich texture and reasonable structure under different text prompts. However, it is an open problem to adapt the pre-trained diffusion model for visual perception. In this paper, we propose an implicit and explicit language guidance framework for diffusion-based perception, named IEDP. Our IEDP comprises an implicit language guidance branch and an explicit language guidance branch. The implicit branch employs frozen CLIP image encoder to directly generate implicit text embeddings that are fed to diffusion model, without using explicit text prompts. The explicit branch utilizes the ground-truth labels of corresponding images as text prompts to condition feature extraction of diffusion model. During training, we jointly train diffusion model by sharing the model weights of these two branches. As a result, implicit and explicit branches can jointly guide feature learning. During inference, we only employ implicit branch for final prediction, which does not require any ground-truth labels. Experiments are performed on two typical perception tasks, including semantic segmentation and depth estimation. Our IEDP achieves promising performance on both tasks. For semantic segmentation, our IEDP has the mIoU$^\text{ss}$ score of 55.9% on AD20K validation set, which outperforms the baseline method VPD by 2.2%. For depth estimation, our IEDP outperforms the baseline method VPD with a relative gain of 11.0%.

Implicit and Explicit Language Guidance for Diffusion-based Visual Perception

TL;DR

This paper proposes an implicit and explicit language guidance framework for diffusion-based visual perception, named IEDP, which comprises an implicit language guidance branch and an explicit language guidance branch that can jointly guide feature learning.

Abstract

Text-to-image diffusion models have shown powerful ability on conditional image synthesis. With large-scale vision-language pre-training, diffusion models are able to generate high-quality images with rich texture and reasonable structure under different text prompts. However, it is an open problem to adapt the pre-trained diffusion model for visual perception. In this paper, we propose an implicit and explicit language guidance framework for diffusion-based perception, named IEDP. Our IEDP comprises an implicit language guidance branch and an explicit language guidance branch. The implicit branch employs frozen CLIP image encoder to directly generate implicit text embeddings that are fed to diffusion model, without using explicit text prompts. The explicit branch utilizes the ground-truth labels of corresponding images as text prompts to condition feature extraction of diffusion model. During training, we jointly train diffusion model by sharing the model weights of these two branches. As a result, implicit and explicit branches can jointly guide feature learning. During inference, we only employ implicit branch for final prediction, which does not require any ground-truth labels. Experiments are performed on two typical perception tasks, including semantic segmentation and depth estimation. Our IEDP achieves promising performance on both tasks. For semantic segmentation, our IEDP has the mIoU score of 55.9% on AD20K validation set, which outperforms the baseline method VPD by 2.2%. For depth estimation, our IEDP outperforms the baseline method VPD with a relative gain of 11.0%.
Paper Structure (16 sections, 5 equations, 4 figures, 8 tables)

This paper contains 16 sections, 5 equations, 4 figures, 8 tables.

Figures (4)

  • Figure 1: Comparison between existing methods and our proposed method. In (a), the existing methods zhao2023unleashingkondapaneni2023textimage first employ all dataset classes or BLIP-2li2023blip model to manually or automatically generate text prompts, and then utilize frozen CLIPradford2021learning text encoder to extract text embeddings, which are fed to stable diffusion to condition feature extraction. During training and inference, these methods adopt the same structure. In (b), our proposed method introduces two branches to generate implicit and explicit text embeddings for stable diffusion during training, where these two branches can jointly train the model. During inference, we only employ implicit branch to generate implicit text embeddings for perception tasks.
  • Figure 2: Overall architecture of our proposed method. In (a), we present the overall architecture of our method. We introduce an implicit language guidance branch and an explicit language guidance branch, which respectively utilize the implicit prompt module and explicit prompt module to condition feature extraction of denoising UNet for the following task-specific decoder. These two branches share the weights of model parameters during training. During inference, we only employ the implicit branch. In (b) and (c), we give the detailed structures of implicit prompt module and explicit prompt module.
  • Figure 3: Visualisation results of semantic segmentation. We provide some qualitative segmentation examples of our proposed method on ADE20K dataset. It can be observed that our proposed method has good segmentation results in various scenarios, including indoor, outdoor, and crowded scene.
  • Figure 4: Visualisation results of depth estimation. We provide some qualitative examples of our proposed method on NYUv2 dataset, including the input images and predicted depth maps.