Text-image Alignment for Diffusion-based Perception

Neehar Kondapaneni; Markus Marks; Manuel Knott; Rogerio Guimaraes; Pietro Perona

Text-image Alignment for Diffusion-based Perception

Neehar Kondapaneni, Markus Marks, Manuel Knott, Rogerio Guimaraes, Pietro Perona

TL;DR

This paper introduces Text-Aligned Diffusion Perception (TADP), which uses automated image captions to align text prompts with diffusion-based perception models, significantly improving downstream tasks such as semantic segmentation and monocular depth estimation. By exploring single-domain prompts and cross-domain caption modifiers, the authors demonstrate substantial gains, including state-of-the-art results on ADE20K and NYUv2, and strong cross-domain transfers from VOC to Watercolor2K and Cityscapes to Dark Zurich/Nighttime Driving. Key insights include the superiority of caption-based conditioning over averaged EOS tokens, the importance of target-domain alignment for cross-domain performance, and the value of model personalization via Textual Inversion or DreamBooth. The work showcases the diffusion backbone’s robust generalization and provides practical guidance for prompting strategies, captioners, and personalization to harness diffusion models for discriminative vision tasks across domains.

Abstract

Diffusion models are generative models with impressive text-to-image synthesis capabilities and have spurred a new wave of creative methods for classical machine learning tasks. However, the best way to harness the perceptual knowledge of these generative models for visual tasks is still an open question. Specifically, it is unclear how to use the prompting interface when applying diffusion backbones to vision tasks. We find that automatically generated captions can improve text-image alignment and significantly enhance a model's cross-attention maps, leading to better perceptual performance. Our approach improves upon the current state-of-the-art (SOTA) in diffusion-based semantic segmentation on ADE20K and the current overall SOTA for depth estimation on NYUv2. Furthermore, our method generalizes to the cross-domain setting. We use model personalization and caption modifications to align our model to the target domain and find improvements over unaligned baselines. Our cross-domain object detection model, trained on Pascal VOC, achieves SOTA results on Watercolor2K. Our cross-domain segmentation method, trained on Cityscapes, achieves SOTA results on Dark Zurich-val and Nighttime Driving. Project page: https://www.vision.caltech.edu/tadp/. Code: https://github.com/damaggu/TADP.

Text-image Alignment for Diffusion-based Perception

TL;DR

Abstract

Paper Structure (19 sections, 5 equations, 24 figures, 9 tables)

This paper contains 19 sections, 5 equations, 24 figures, 9 tables.

Introduction
Related Work
Diffusion models for single-domain vision tasks
Image captioning
Diffusion models for cross-domain vision tasks
Cross-domain object detection
Methods
Text-Aligned Diffusion Perception (TADP)
Results
Latent scaling
Single-domain alignment
Cross-domain alignment
Discussion
Acknowledgements.
Cross-attention analysis
...and 4 more sections

Figures (24)

Figure 1: Text-Aligned Diffusion Perception (TADP). In TADP, image captions align the text prompts and images passed to diffusion-based vision models. In cross-domain tasks, target domain information is incorporated into the prompt to boost performance.
Figure 2: Overview of TADP. We test several prompting strategies and evaluate their impact on downstream vision task performance. Our method concatenates the cross-attention and multi-scale feature maps before passing them to the vision-specific decoder. In the blue box, we show three single-domain captioning strategies with differing levels of text-image alignment. We propose using BLIP li_blip-2_2023 captioning to improve image-text alignment. We extend our analysis to the cross-domain setting (yellow box), exploring whether aligning the source domain text captions to the target domain may impact model performance by appending caption modifiers to image captions generated in the source domain and find model personalization modifiers (Textual Inversion/Dreambooth) work best.
Figure 3: Effects of Latent Scaling (LS) and BLIP caption minimum length. We report mIoU for Pascal, mIoU for ADE20K, and RMSE for NYUv2 depth (right). (Top) Latent scaling improves performance on Pascal ${\sim}0.8$ mIoU (higher is better), ${\sim}0.3$ mIoU, and ${\sim}5.5\%$ relative RMSE (lower is better). (Bottom) We see a similar effect for BLIP minimum token length, with longer captions performing better, improving ${\sim}0.8$ mIoU on Pascal, ${\sim}0.9$ mIoU on ADE20K, and ${\sim}0.6\%$ relative RMSE.
Figure 4: Cross-attention maps for different types of prompting (before training). We compare the cross-attention maps for four types of prompting: oracle, BLIP, Average EOS tokens, and class names as space-separated strings. The cross-attention maps for different heads at all different scales are upsampled to 64x64 and averaged. When comparing Average Template EOS and Class Names, we see (qualitatively) averaging degrades the quality of the cross-attention maps. Furthermore, we find that class names that are not present in the image can have highly localized attention maps (e.g., 'bottle'). Further analysis of the cross-attention maps is available in Sec. \ref{['appendix:cross-attention']}, where we explore image-to-image generation, copy-paste image modifications, and more.
Figure S1: Qualitative image-to-image variation. An untrained stable diffusion model is passed an image to perform image-to-image variation. The number of denoising steps conducted increases from left to right (5 to 45 out of a total of 50). On the top row, we pass all the class names in Pascal VOC 2012: "background airplane bicycle bird boat bottle bus car cat chair cow dining table dog horse motorcycle person potted plant sheep sofa train television". In the bottom row we pass the BLIP caption "a bird and a dog".
...and 19 more figures

Text-image Alignment for Diffusion-based Perception

TL;DR

Abstract

Text-image Alignment for Diffusion-based Perception

Authors

TL;DR

Abstract

Table of Contents

Figures (24)