Table of Contents
Fetching ...

SIGMA: Selective-Interleaved Generation with Multi-Attribute Tokens

Xiaoyan Zhang, Zechen Bai, Haofan Wang, Yiren Song

TL;DR

SIGMA tackles the binding problem in multi-reference image generation by extending a unified diffusion transformer with interleaved multi-condition inputs and selective multi-attribute tokens. It introduces a four-part framework: post-training a unified Bagel backbone, a vocabulary of multi-attribute tokens, an interleaved text–image conditioning scheme, and a group-scoped attention mask to prevent cross-condition leakage. Trained on a 700K interleaved dataset spanning six task families, SIGMA achieves superior controllability, compositionality, and visual fidelity, outperforming Bagel and approaching the capabilities of GPT-4o and Nano-Banana on several benchmarks. The method is model-agnostic and provides a practical foundation for structured, multi-source conditioning in diffusion-based generative systems, enabling flexible editing and synthesis without architectural retraining.

Abstract

Recent unified models such as Bagel demonstrate that paired image-edit data can effectively align multiple visual tasks within a single diffusion transformer. However, these models remain limited to single-condition inputs and lack the flexibility needed to synthesize results from multiple heterogeneous sources. We present SIGMA (Selective-Interleaved Generation with Multi-Attribute Tokens), a unified post-training framework that enables interleaved multi-condition generation within diffusion transformers. SIGMA introduces selective multi-attribute tokens, including style, content, subject, and identity tokens, which allow the model to interpret and compose multiple visual conditions in an interleaved text-image sequence. Through post-training on the Bagel unified backbone with 700K interleaved examples, SIGMA supports compositional editing, selective attribute transfer, and fine-grained multimodal alignment. Extensive experiments show that SIGMA improves controllability, cross-condition consistency, and visual quality across diverse editing and generation tasks, with substantial gains over Bagel on compositional tasks.

SIGMA: Selective-Interleaved Generation with Multi-Attribute Tokens

TL;DR

SIGMA tackles the binding problem in multi-reference image generation by extending a unified diffusion transformer with interleaved multi-condition inputs and selective multi-attribute tokens. It introduces a four-part framework: post-training a unified Bagel backbone, a vocabulary of multi-attribute tokens, an interleaved text–image conditioning scheme, and a group-scoped attention mask to prevent cross-condition leakage. Trained on a 700K interleaved dataset spanning six task families, SIGMA achieves superior controllability, compositionality, and visual fidelity, outperforming Bagel and approaching the capabilities of GPT-4o and Nano-Banana on several benchmarks. The method is model-agnostic and provides a practical foundation for structured, multi-source conditioning in diffusion-based generative systems, enabling flexible editing and synthesis without architectural retraining.

Abstract

Recent unified models such as Bagel demonstrate that paired image-edit data can effectively align multiple visual tasks within a single diffusion transformer. However, these models remain limited to single-condition inputs and lack the flexibility needed to synthesize results from multiple heterogeneous sources. We present SIGMA (Selective-Interleaved Generation with Multi-Attribute Tokens), a unified post-training framework that enables interleaved multi-condition generation within diffusion transformers. SIGMA introduces selective multi-attribute tokens, including style, content, subject, and identity tokens, which allow the model to interpret and compose multiple visual conditions in an interleaved text-image sequence. Through post-training on the Bagel unified backbone with 700K interleaved examples, SIGMA supports compositional editing, selective attribute transfer, and fine-grained multimodal alignment. Extensive experiments show that SIGMA improves controllability, cross-condition consistency, and visual quality across diverse editing and generation tasks, with substantial gains over Bagel on compositional tasks.
Paper Structure (39 sections, 9 equations, 15 figures, 4 tables)

This paper contains 39 sections, 9 equations, 15 figures, 4 tables.

Figures (15)

  • Figure 1: Overview of tasks within our unified framework, covering diverse generation scenarios including compositional generation, selective generation, stylization, style relation transfer, and layout-based generation.
  • Figure 2: Overview of SIGMA. Left: Each sample may include multiple images, and each image can involve multiple visual attributes. Text–Image Interleave aligns textual spans with image placeholders, while Special Token Adding binds specific attributes to their corresponding images, avoiding semantic entanglement. Right: On top of causal attention, we add group-scoped masks so each special token can only attend to images within its own group, effectively reducing cross-image interference and ensuring clean attribute–image binding.
  • Figure 3: Overview and construction of the interleaved multi-condition dataset. (a) The 700K corpus spans six task families, including compositional generation, selective content extraction, stylization, relation transfer, image editing, and conditional layout. (b) Data are built via compositional generation (GPT-4o + Nano-Banana), selective content extraction (object matching with GPT-4o), and token-injection that binds text entities with reference images for interleaved supervision.
  • Figure 4: Qualitative results achieved by SIGMA. By leveraging specialized attribute tokens, SIGMA flexibly binds the required elements from input references, accomplishing a wide range of generation tasks. The results demonstrate clear binding between inputs and outputs, as well as high-quality, visually coherent generations across all scenarios.
  • Figure 5: Qualitative comparisons on our benchmark.
  • ...and 10 more figures