Table of Contents
Fetching ...

ISAC: Training-Free Instance-to-Semantic Attention Control for Improving Multi-Instance Generation

Sanghyun Jo, Wooyeol Lee, Ziseok Lee, Kyungsu Kim

TL;DR

ISAC addresses the core challenge of reliable multi-object generation in diffusion models by enforcing an instance-first generation process. It is a training-free, model-agnostic framework that first forms disjoint instance layouts from self-attention and then binds semantics to these instances via a cross-attention–driven, repel-and-bind objective with a timestepped loss schedule. The two-phase approach yields substantial improvements in multi-object counting and intra-category composition across text-to-image and layout-to-image settings, outperforming prior training-free methods and matching or exceeding some count-supervised approaches without additional training. This instance-centric decoupling enhances robustness and controllability in complex scenes, with strong practical implications for applications requiring precise object counts and distinct object semantics.

Abstract

Text-to-image diffusion models have recently become highly capable, yet their behavior in multi-object scenes remains unreliable: models often produce an incorrect number of instances and exhibit semantics leaking across objects. We trace these failures to vague instance boundaries; self-attention already reveals instance layouts early in the denoising process, but existing approaches act only on semantic signals. We introduce $\textbf{ISAC}$ ($\textbf{I}$nstance-to-$\textbf{S}$emantic $\textbf{A}$ttention $\textbf{C}$ontrol), a training-free, model-agnostic objective that performs hierarchical attention control by first carving out instance layouts from self-attention and then binding semantics to these instances. In Phase 1, ISAC clusters self-attention into the number of instances and repels overlaps, establishing an instance-level structural hierarchy; in Phase 2, it injects these instance cues into cross-attention to obtain instance-aware semantic masks and decomposes mixing semantics by tying attributes within each instance. ISAC yields consistent gains on T2I-CompBench, HRS-Bench, and IntraCompBench, our new benchmark for intra-class compositions where failures are most frequent, with improvements of at least 50% in multi-class accuracy and 7% in multi-instance accuracy on IntraCompBench, without any fine-tuning or external models. Beyond text-to-image setups, ISAC also strengthens layout-to-image controllers under overlapping boxes by refining coarse box layouts into dense instance masks, indicating that hierarchical decoupling of instance formation and semantic assignment is a key principle for robust, controllable multi-object generation. Code will be released upon publication.

ISAC: Training-Free Instance-to-Semantic Attention Control for Improving Multi-Instance Generation

TL;DR

ISAC addresses the core challenge of reliable multi-object generation in diffusion models by enforcing an instance-first generation process. It is a training-free, model-agnostic framework that first forms disjoint instance layouts from self-attention and then binds semantics to these instances via a cross-attention–driven, repel-and-bind objective with a timestepped loss schedule. The two-phase approach yields substantial improvements in multi-object counting and intra-category composition across text-to-image and layout-to-image settings, outperforming prior training-free methods and matching or exceeding some count-supervised approaches without additional training. This instance-centric decoupling enhances robustness and controllability in complex scenes, with strong practical implications for applications requiring precise object counts and distinct object semantics.

Abstract

Text-to-image diffusion models have recently become highly capable, yet their behavior in multi-object scenes remains unreliable: models often produce an incorrect number of instances and exhibit semantics leaking across objects. We trace these failures to vague instance boundaries; self-attention already reveals instance layouts early in the denoising process, but existing approaches act only on semantic signals. We introduce (nstance-to-emantic ttention ontrol), a training-free, model-agnostic objective that performs hierarchical attention control by first carving out instance layouts from self-attention and then binding semantics to these instances. In Phase 1, ISAC clusters self-attention into the number of instances and repels overlaps, establishing an instance-level structural hierarchy; in Phase 2, it injects these instance cues into cross-attention to obtain instance-aware semantic masks and decomposes mixing semantics by tying attributes within each instance. ISAC yields consistent gains on T2I-CompBench, HRS-Bench, and IntraCompBench, our new benchmark for intra-class compositions where failures are most frequent, with improvements of at least 50% in multi-class accuracy and 7% in multi-instance accuracy on IntraCompBench, without any fine-tuning or external models. Beyond text-to-image setups, ISAC also strengthens layout-to-image controllers under overlapping boxes by refining coarse box layouts into dense instance masks, indicating that hierarchical decoupling of instance formation and semantic assignment is a key principle for robust, controllable multi-object generation. Code will be released upon publication.

Paper Structure

This paper contains 56 sections, 12 equations, 22 figures, 16 tables, 2 algorithms.

Figures (22)

  • Figure 1: Importance of instance-level control for multi-object generation. Existing text-to-image diffusion models (e.g., SD1.5 rombach2022high) and prior training-free guidance methods guo2024initnoqiu2025self still suffer from count failures (missing or merged instances) and semantic mixing (attributes spilling across objects), whereas ISAC’s instance-first design yields the correct number of instances with clearly separated semantics.
  • Figure 2: Semantic overlap across object pairs. We measure semantic mixing by the Dice coefficient between the two instance-aware semantic masks in \ref{['eq:propagate']} for prompts of the form "A photo of a < object1> and a < object2>" with SD3.5-M esser2024scaling. Left: heatmap over object pairs. Within-supercategory pairs (fruits, vehicles, animals; blue boxes) show consistently higher overlap, revealing that semantic masks tend to cover multiple similar objects at once. Right: qualitative examples. We visualize the signed difference between the two masks, normalized by their summed attention strengths. Color intensity reflects the strength of dominance.
  • Figure 3: Dynamics of Text-to-image diffusion models. In early diffusion steps, instance structures actively emerge while semantics underdeveloped. In later diffusion steps, instance structures are stabilized and semantic refinements happen. As detection models (e.g., liu2024grounding) rely on strong semantic cues, they are only effective in later steps. We use a prompt of "A photo of a cat and a dog" on SD3.5-M esser2024scaling.
  • Figure 4: Overview of ISAC. Given a multi-object prompt, ISAC steers diffusion in two phases. In Phase 1 (\ref{['sec:phase_1']}), we cluster the self-attention map into $N$ class-agnostic instance masks and apply an instance separation loss that repels overlaps, yielding clean instance layouts early in the trajectory. In Phase 2 (\ref{['sec:phase_1']}), the self-attention map with reliable instance structures is injected into cross-attention (CA) to produce instance-aware semantic masks, and a repel-and-bind loss pushes apart incompatible tokens while binding attributes within each instance so that semantics follow instance shapes. An instance-to-semantic schedule gradually shifts weight from Phase 1 to Phase 2 across timesteps, aligning ISAC’s control with the diffusion dynamics (see \ref{['sec:isac_loss_schedule']}).
  • Figure 5: Qualitative comparison using SD1.5 rombach2022high and SD3.5-M esser2024scaling as a backbone and added attention control methods.
  • ...and 17 more figures