Table of Contents
Fetching ...

TRACE: Your Diffusion Model is Secretly an Instance Edge Detector

Sanghyun Jo, Ziseok Lee, Wooyeol Lee, Jonghyun Choi, Jaesik Park, Kyungsu Kim

TL;DR

TRACE shows that text-to-image diffusion models secretly encode instance boundary priors in their self-attention maps during denoising. By identifying the Instance Emergence Point via KL divergence and converting SA signals into edges with Attention Boundary Divergence, TRACE distills these cues into a fast one-shot edge decoder. The resulting instance edges improve unsupervised and weakly-supervised segmentation and serve as high-quality seeds for open-vocabulary systems like SAM, all without per-image inversion or instance annotations. This yields sharp, connected boundaries and scalable, annotation-free panoptic perception across diverse datasets and backbones, with substantial speedups and competitive performance gains.

Abstract

High-quality instance and panoptic segmentation has traditionally relied on dense instance-level annotations such as masks, boxes, or points, which are costly, inconsistent, and difficult to scale. Unsupervised and weakly-supervised approaches reduce this burden but remain constrained by semantic backbone constraints and human bias, often producing merged or fragmented outputs. We present TRACE (TRAnsforming diffusion Cues to instance Edges), showing that text-to-image diffusion models secretly function as instance edge annotators. TRACE identifies the Instance Emergence Point (IEP) where object boundaries first appear in self-attention maps, extracts boundaries through Attention Boundary Divergence (ABDiv), and distills them into a lightweight one-step edge decoder. This design removes the need for per-image diffusion inversion, achieving 81x faster inference while producing sharper and more connected boundaries. On the COCO benchmark, TRACE improves unsupervised instance segmentation by +5.1 AP, and in tag-supervised panoptic segmentation it outperforms point-supervised baselines by +1.7 PQ without using any instance-level labels. These results reveal that diffusion models encode hidden instance boundary priors, and that decoding these signals offers a practical and scalable alternative to costly manual annotation. Code is available at https://github.com/shjo-april/DiffEGG.

TRACE: Your Diffusion Model is Secretly an Instance Edge Detector

TL;DR

TRACE shows that text-to-image diffusion models secretly encode instance boundary priors in their self-attention maps during denoising. By identifying the Instance Emergence Point via KL divergence and converting SA signals into edges with Attention Boundary Divergence, TRACE distills these cues into a fast one-shot edge decoder. The resulting instance edges improve unsupervised and weakly-supervised segmentation and serve as high-quality seeds for open-vocabulary systems like SAM, all without per-image inversion or instance annotations. This yields sharp, connected boundaries and scalable, annotation-free panoptic perception across diverse datasets and backbones, with substantial speedups and competitive performance gains.

Abstract

High-quality instance and panoptic segmentation has traditionally relied on dense instance-level annotations such as masks, boxes, or points, which are costly, inconsistent, and difficult to scale. Unsupervised and weakly-supervised approaches reduce this burden but remain constrained by semantic backbone constraints and human bias, often producing merged or fragmented outputs. We present TRACE (TRAnsforming diffusion Cues to instance Edges), showing that text-to-image diffusion models secretly function as instance edge annotators. TRACE identifies the Instance Emergence Point (IEP) where object boundaries first appear in self-attention maps, extracts boundaries through Attention Boundary Divergence (ABDiv), and distills them into a lightweight one-step edge decoder. This design removes the need for per-image diffusion inversion, achieving 81x faster inference while producing sharper and more connected boundaries. On the COCO benchmark, TRACE improves unsupervised instance segmentation by +5.1 AP, and in tag-supervised panoptic segmentation it outperforms point-supervised baselines by +1.7 PQ without using any instance-level labels. These results reveal that diffusion models encode hidden instance boundary priors, and that decoding these signals offers a practical and scalable alternative to costly manual annotation. Code is available at https://github.com/shjo-april/DiffEGG.

Paper Structure

This paper contains 38 sections, 23 equations, 21 figures, 16 tables, 4 algorithms.

Figures (21)

  • Figure 1: Emergence and extraction of instance cues in diffusion attention. (a) In reverse process, cross-attention remains semantic even with the prompt, whereas self-attention at specific steps reveals instance-level structure. (b) TRACE selects the instance-emergent step using a temporal divergence criterion, extracts non-parametric edges from self-attention differences, and refines them via a one-step distillation with the diffusion backbone to refine instance boundaries.
  • Figure 2: Effect of TRACE. (Left) Our instance edges decoded from diffusion self-attention for reconnection of fragmented masks and separation of adjacent objects, with white dotted circles marking corrected boundaries. (Right) Consistent AP$_{mk}$ gains over baselines wang2023cutli2024promerge without instance-level annotations.
  • Figure 3: Example of human bias in COCO.
  • Figure 4: Overview of TRACE. (a) Diffusion forward locates the instance emergence point $t^\star$ (IEP) via a KL peak and extracts the instance-aware attention $SA(X_{t^\star})$; ABDiv converts it into a pseudo edge map $E$. (b) One step self distillation at $t{=}0$ trains an edge decoder $\mathcal{G}_\phi$ with $E$, masking uncertain pixels. Training from $E$ closes gaps in fragmented edges (green circles) and yields connected boundaries $\hat{E}$. At inference, TRACE predicts $\hat{E}$ in a single pass w/o IEP or ABDiv.
  • Figure 5: Illustration of Attention Boundary Divergence (ABDiv). Boundary regions (a) exhibit sharp attention divergence between opposite neighbors, whereas interior regions (b) remain stable, producing much smaller ABDiv values.
  • ...and 16 more figures