Table of Contents
Fetching ...

SIGMA-GEN: Structure and Identity Guided Multi-subject Assembly for Image Generation

Oindrila Saha, Vojtech Krs, Radomir Mech, Subhransu Maji, Kevin Blackburn-Matzen, Matheus Gadelha

TL;DR

SIGMA-Gen addresses the need for simultaneous identity and structure control in multi-subject image generation by introducing a single diffusion-based model that leverages both subject identity cues and a unified spatial control representation. It introduces SIGMA-Set27K, a large synthetic dataset with multiple identities per image and per-subject annotations, enabling robust training of the model. The approach combines a two-part spatial conditioning (routing and depth) with per-subject identity conditioning via identity crops, achieving state-of-the-art performance in identity preservation, image fidelity, and generation speed, especially in scenes with five or more subjects. The framework supports versatile applications such as subject insertion and reposing, and demonstrates strong generalization across coarse to fine control modalities, marking a significant step toward practical, controllable multi-subject generation.

Abstract

We present SIGMA-GEN, a unified framework for multi-identity preserving image generation. Unlike prior approaches, SIGMA-GEN is the first to enable single-pass multi-subject identity-preserved generation guided by both structural and spatial constraints. A key strength of our method is its ability to support user guidance at various levels of precision -- from coarse 2D or 3D boxes to pixel-level segmentations and depth -- with a single model. To enable this, we introduce SIGMA-SET27K, a novel synthetic dataset that provides identity, structure, and spatial information for over 100k unique subjects across 27k images. Through extensive evaluation we demonstrate that SIGMA-GEN achieves state-of-the-art performance in identity preservation, image generation quality, and speed. Code and visualizations at https://oindrilasaha.github.io/SIGMA-Gen/

SIGMA-GEN: Structure and Identity Guided Multi-subject Assembly for Image Generation

TL;DR

SIGMA-Gen addresses the need for simultaneous identity and structure control in multi-subject image generation by introducing a single diffusion-based model that leverages both subject identity cues and a unified spatial control representation. It introduces SIGMA-Set27K, a large synthetic dataset with multiple identities per image and per-subject annotations, enabling robust training of the model. The approach combines a two-part spatial conditioning (routing and depth) with per-subject identity conditioning via identity crops, achieving state-of-the-art performance in identity preservation, image fidelity, and generation speed, especially in scenes with five or more subjects. The framework supports versatile applications such as subject insertion and reposing, and demonstrates strong generalization across coarse to fine control modalities, marking a significant step toward practical, controllable multi-subject generation.

Abstract

We present SIGMA-GEN, a unified framework for multi-identity preserving image generation. Unlike prior approaches, SIGMA-GEN is the first to enable single-pass multi-subject identity-preserved generation guided by both structural and spatial constraints. A key strength of our method is its ability to support user guidance at various levels of precision -- from coarse 2D or 3D boxes to pixel-level segmentations and depth -- with a single model. To enable this, we introduce SIGMA-SET27K, a novel synthetic dataset that provides identity, structure, and spatial information for over 100k unique subjects across 27k images. Through extensive evaluation we demonstrate that SIGMA-GEN achieves state-of-the-art performance in identity preservation, image generation quality, and speed. Code and visualizations at https://oindrilasaha.github.io/SIGMA-Gen/

Paper Structure

This paper contains 27 sections, 20 figures, 3 tables.

Figures (20)

  • Figure 1: SIGMA-Gen enhances controllability of text-to-image workflows by allowing users to prescribe both structure and subject identity. In the top row, RGB images are used to describe subject identities. A 3D scene can be arranged by the user to describe the image structure; in these examples, meshes were automatically created using image-to-3D. The user can then assign identities to each subject (colors representing the assignments) and generate images while precisely editing the 3D scene. In the bottom part of the figure, we show that SIGMA-Gen can also be applied to simpler modes of structure guidance --- 2D and 3D bounding boxes.
  • Figure 2: Pipeline for generating SIGMA-Set27K. Our fully automatic synthetic data generation pipeline involves creating compositional prompts with an LLM, generating images from these prompts, segmenting to obtain subject crops, reposing the crops to produce identity images, and estimating depth and 3D bounding boxes. We also show an example of a training sample for fine control scenario of using precise masks and depth. The routing mask is colored to RGB for visualization purpose, the pixel values for the subjects being 10, 20, 30 in practice for this example.
  • Figure 2: Ablation over increasing guidance.(bg) represents text prompts describing only the background, whereas (full) describes the whole scene. Removing depth reduces performance while providing full prompts that include subject names improves performance.
  • Figure 3: Multi-subject generation with masks and depth.SIGMA-Gen outperforms baselines both in terms of image quality (see zoomed crops at top-right) and subject identity preservation. For our case we prepend "Place these subjects in" to the prompts.
  • Figure 4: Multi-subject generation with coarse controls. Baseline fails to maintain position or identity, while SIGMA-Gen adheres to both 2D and 3D bounding-box coarse control. For our case we prepend "Place these subjects to compose: " to the prompt.
  • ...and 15 more figures