SIGMA: Selective-Interleaved Generation with Multi-Attribute Tokens
Xiaoyan Zhang, Zechen Bai, Haofan Wang, Yiren Song
TL;DR
SIGMA tackles the binding problem in multi-reference image generation by extending a unified diffusion transformer with interleaved multi-condition inputs and selective multi-attribute tokens. It introduces a four-part framework: post-training a unified Bagel backbone, a vocabulary of multi-attribute tokens, an interleaved text–image conditioning scheme, and a group-scoped attention mask to prevent cross-condition leakage. Trained on a 700K interleaved dataset spanning six task families, SIGMA achieves superior controllability, compositionality, and visual fidelity, outperforming Bagel and approaching the capabilities of GPT-4o and Nano-Banana on several benchmarks. The method is model-agnostic and provides a practical foundation for structured, multi-source conditioning in diffusion-based generative systems, enabling flexible editing and synthesis without architectural retraining.
Abstract
Recent unified models such as Bagel demonstrate that paired image-edit data can effectively align multiple visual tasks within a single diffusion transformer. However, these models remain limited to single-condition inputs and lack the flexibility needed to synthesize results from multiple heterogeneous sources. We present SIGMA (Selective-Interleaved Generation with Multi-Attribute Tokens), a unified post-training framework that enables interleaved multi-condition generation within diffusion transformers. SIGMA introduces selective multi-attribute tokens, including style, content, subject, and identity tokens, which allow the model to interpret and compose multiple visual conditions in an interleaved text-image sequence. Through post-training on the Bagel unified backbone with 700K interleaved examples, SIGMA supports compositional editing, selective attribute transfer, and fine-grained multimodal alignment. Extensive experiments show that SIGMA improves controllability, cross-condition consistency, and visual quality across diverse editing and generation tasks, with substantial gains over Bagel on compositional tasks.
