Table of Contents
Fetching ...

RefAM: Attention Magnets for Zero-Shot Referral Segmentation

Anna Kukleva, Enis Simsar, Alessio Tonioni, Muhammad Ferjad Naeem, Federico Tombari, Jan Eric Lenssen, Bernt Schiele

TL;DR

RefAM tackles zero-shot referring segmentation by exploiting cross-attention features from diffusion transformers without any fine-tuning. The key idea is to identify global attention sinks and to use attention magnets—append-only stop words and color terms—to redistribute background attention, yielding sharper and more localized grounding maps. This training-free framework combines cross-attention aggregation, GAS handling, and redistribution to achieve state-of-the-art performance on multiple image and video referring segmentation benchmarks. The approach demonstrates that diffusion-based features can serve as a powerful general foundation for grounding expressions, offering practical efficiency and broad applicability without architectural changes or additional supervision.

Abstract

Most existing approaches to referring segmentation achieve strong performance only through fine-tuning or by composing multiple pre-trained models, often at the cost of additional training and architectural modifications. Meanwhile, large-scale generative diffusion models encode rich semantic information, making them attractive as general-purpose feature extractors. In this work, we introduce a new method that directly exploits features, attention scores, from diffusion transformers for downstream tasks, requiring neither architectural modifications nor additional training. To systematically evaluate these features, we extend benchmarks with vision-language grounding tasks spanning both images and videos. Our key insight is that stop words act as attention magnets: they accumulate surplus attention and can be filtered to reduce noise. Moreover, we identify global attention sinks (GAS) emerging in deeper layers and show that they can be safely suppressed or redirected onto auxiliary tokens, leading to sharper and more accurate grounding maps. We further propose an attention redistribution strategy, where appended stop words partition background activations into smaller clusters, yielding sharper and more localized heatmaps. Building on these findings, we develop RefAM, a simple training-free grounding framework that combines cross-attention maps, GAS handling, and redistribution. Across zero-shot referring image and video segmentation benchmarks, our approach achieves strong performance and surpasses prior methods on most datasets, establishing a new state of the art without fine-tuning, additional components and complex reasoning.

RefAM: Attention Magnets for Zero-Shot Referral Segmentation

TL;DR

RefAM tackles zero-shot referring segmentation by exploiting cross-attention features from diffusion transformers without any fine-tuning. The key idea is to identify global attention sinks and to use attention magnets—append-only stop words and color terms—to redistribute background attention, yielding sharper and more localized grounding maps. This training-free framework combines cross-attention aggregation, GAS handling, and redistribution to achieve state-of-the-art performance on multiple image and video referring segmentation benchmarks. The approach demonstrates that diffusion-based features can serve as a powerful general foundation for grounding expressions, offering practical efficiency and broad applicability without architectural changes or additional supervision.

Abstract

Most existing approaches to referring segmentation achieve strong performance only through fine-tuning or by composing multiple pre-trained models, often at the cost of additional training and architectural modifications. Meanwhile, large-scale generative diffusion models encode rich semantic information, making them attractive as general-purpose feature extractors. In this work, we introduce a new method that directly exploits features, attention scores, from diffusion transformers for downstream tasks, requiring neither architectural modifications nor additional training. To systematically evaluate these features, we extend benchmarks with vision-language grounding tasks spanning both images and videos. Our key insight is that stop words act as attention magnets: they accumulate surplus attention and can be filtered to reduce noise. Moreover, we identify global attention sinks (GAS) emerging in deeper layers and show that they can be safely suppressed or redirected onto auxiliary tokens, leading to sharper and more accurate grounding maps. We further propose an attention redistribution strategy, where appended stop words partition background activations into smaller clusters, yielding sharper and more localized heatmaps. Building on these findings, we develop RefAM, a simple training-free grounding framework that combines cross-attention maps, GAS handling, and redistribution. Across zero-shot referring image and video segmentation benchmarks, our approach achieves strong performance and surpasses prior methods on most datasets, establishing a new state of the art without fine-tuning, additional components and complex reasoning.

Paper Structure

This paper contains 22 sections, 4 equations, 14 figures, 12 tables.

Figures (14)

  • Figure 1: Global Attention Sinks (GAS) in DiT. We highlight tokens (here, tokens #1 and #16) that act as GAS in late layers. These tokens allocate disproportionately high and nearly uniform attention across all text and image tokens simultaneously. GAS are absent in early layers, emerge consistently in deeper blocks, and serve as indicators of semantic structure. While uninformative themselves, they can suppress useful signals when they occur on meaningful tokens.
  • Figure 2: Pipeline overview. We first extract cross-attention maps for the referring expression with attention magnets. Next, we filter out stop words and attention magnets, aggregate the remaining maps, identify the argmax location, and apply SAM to generate the final segmentation mask.
  • Figure 3: Emergence of semantic information in DiT. Top: text-to-text attention across layers. Early layers (0–19) are diffuse and uniform, while middle and late layers (20–47) develop block-diagonal structure, indicating meaningful linguistic grouping. Bottom: text-to-image attention for the “_patches” token. Early layers spread attention broadly over the scene, whereas middle layers begin to localize, and late layers sharpen around the target object. These dynamics illustrate how semantic alignment emerges progressively with depth.
  • Figure 4: Entropy across transformer blocks. Blocks 0-25 contain no specific information.
  • Figure 5: Influence of attention magnets on RVOS. Examples demonstrating attention magnets filtering impact.
  • ...and 9 more figures