Table of Contents
Fetching ...

Making Training-Free Diffusion Segmentors Scale with the Generative Power

Benyuan Meng, Qianqian Xu, Zitai Wang, Xiaochun Cao, Longtao Huang, Qingming Huang

TL;DR

Two techniques are proposed: auto aggregation and per-pixel rescaling, which together enable training-free segmentation to better leverage generative capability, and are evaluated on standard semantic segmentation benchmarks.

Abstract

As powerful generative models, text-to-image diffusion models have recently been explored for discriminative tasks. A line of research focuses on adapting a pre-trained diffusion model to semantic segmentation without any further training, leading to what training-free diffusion segmentors. These methods typically rely on cross-attention maps from the model's attention layers, which are assumed to capture semantic relationships between image pixels and text tokens. Ideally, such approaches should benefit from more powerful diffusion models, i.e., stronger generative capability should lead to better segmentation. However, we observe that existing methods often fail to scale accordingly. To understand this issue, we identify two underlying gaps: (i) cross-attention is computed across multiple heads and layers, but there exists a discrepancy between these individual attention maps and a unified global representation. (ii) Even when a global map is available, it does not directly translate to accurate semantic correlation for segmentation, due to score imbalances among different text tokens. To bridge these gaps, we propose two techniques: auto aggregation and per-pixel rescaling, which together enable training-free segmentation to better leverage generative capability. We evaluate our approach on standard semantic segmentation benchmarks and further integrate it into a generative technique, demonstrating both improved performance broad applicability. Codes are at https://github.com/Darkbblue/goca.

Making Training-Free Diffusion Segmentors Scale with the Generative Power

TL;DR

Two techniques are proposed: auto aggregation and per-pixel rescaling, which together enable training-free segmentation to better leverage generative capability, and are evaluated on standard semantic segmentation benchmarks.

Abstract

As powerful generative models, text-to-image diffusion models have recently been explored for discriminative tasks. A line of research focuses on adapting a pre-trained diffusion model to semantic segmentation without any further training, leading to what training-free diffusion segmentors. These methods typically rely on cross-attention maps from the model's attention layers, which are assumed to capture semantic relationships between image pixels and text tokens. Ideally, such approaches should benefit from more powerful diffusion models, i.e., stronger generative capability should lead to better segmentation. However, we observe that existing methods often fail to scale accordingly. To understand this issue, we identify two underlying gaps: (i) cross-attention is computed across multiple heads and layers, but there exists a discrepancy between these individual attention maps and a unified global representation. (ii) Even when a global map is available, it does not directly translate to accurate semantic correlation for segmentation, due to score imbalances among different text tokens. To bridge these gaps, we propose two techniques: auto aggregation and per-pixel rescaling, which together enable training-free segmentation to better leverage generative capability. We evaluate our approach on standard semantic segmentation benchmarks and further integrate it into a generative technique, demonstrating both improved performance broad applicability. Codes are at https://github.com/Darkbblue/goca.
Paper Structure (29 sections, 13 equations, 12 figures, 9 tables)

This paper contains 29 sections, 13 equations, 12 figures, 9 tables.

Figures (12)

  • Figure 1: (a) Previous training-free diffusion segmentors scale poorly with the generative power of diffusion models, which inspires our study to enable such scaling. (b) We have identified two gaps from individual cross-attention maps to semantic correlation, which have been preventing the aforementioned scaling.
  • Figure 2: Attention maps in different heads and layers show a certain collaboration pattern, each focusing on distinct aspects of the image.
  • Figure 3: Overview of our method. (a) and (b) consist the auto aggregation part, and (c) is the per-pixel rescaling part. (a) Head-wise aggregation reformulates multi-head attention as vector summation and then uses the dot-product similarity between each vector and the summed vector as head weights. (b) Layer-wise aggregation computes pseudo self-attention based on a chosen dense feature, regards it as the pseudo global attention, and finally uses the similarity between per-layer self-attention maps and this pseudo global attention as layer weights. (c) We exclude semantic special tokens and stop word tokens, only considering content word tokens, to rescale their attention scores to sum to 1, followed by a conventional per-token re-normalization.
  • Figure 4: Illustration of imbalance phenomena in raw global attention scores.
  • Figure 5: Visualization of per-class attention maps and the final segmentation using Vanilla and Ours. The input image is as shown in the top-left corner, and the prompt is "a cat on grass before wall".
  • ...and 7 more figures