Table of Contents
Fetching ...

Seg2Any: Open-set Segmentation-Mask-to-Image Generation with Precise Shape and Semantic Control

Danfeng Li, Hui Zhang, Sheng Wang, Jiacheng Li, Zuxuan Wu

TL;DR

Seg2Any tackles the challenge of achieving precise spatial layout and semantic control in open-set segmentation-mask-to-image generation. It introduces a semantic-shape decoupling framework that uses a Semantic Alignment Attention Mask and an Entity Contour Map via Sparse Shape Feature Adaptation, along with an Attribute Isolation Attention Mask to prevent cross-entity leakage. The authors contribute SACap-1M, a 1M-image open-set dataset with 5.9M regional captions, plus SACap-Eval for open-set evaluation, enabling robust benchmarking. Experiments show state-of-the-art results on both open-set and closed-set S2I tasks, with strong performance in fine-grained spatial and attribute control and competitive global image quality.

Abstract

Despite recent advances in diffusion models, top-tier text-to-image (T2I) models still struggle to achieve precise spatial layout control, i.e. accurately generating entities with specified attributes and locations. Segmentation-mask-to-image (S2I) generation has emerged as a promising solution by incorporating pixel-level spatial guidance and regional text prompts. However, existing S2I methods fail to simultaneously ensure semantic consistency and shape consistency. To address these challenges, we propose Seg2Any, a novel S2I framework built upon advanced multimodal diffusion transformers (e.g. FLUX). First, to achieve both semantic and shape consistency, we decouple segmentation mask conditions into regional semantic and high-frequency shape components. The regional semantic condition is introduced by a Semantic Alignment Attention Mask, ensuring that generated entities adhere to their assigned text prompts. The high-frequency shape condition, representing entity boundaries, is encoded as an Entity Contour Map and then introduced as an additional modality via multi-modal attention to guide image spatial structure. Second, to prevent attribute leakage across entities in multi-entity scenarios, we introduce an Attribute Isolation Attention Mask mechanism, which constrains each entity's image tokens to attend exclusively to themselves during image self-attention. To support open-set S2I generation, we construct SACap-1M, a large-scale dataset containing 1 million images with 5.9 million segmented entities and detailed regional captions, along with a SACap-Eval benchmark for comprehensive S2I evaluation. Extensive experiments demonstrate that Seg2Any achieves state-of-the-art performance on both open-set and closed-set S2I benchmarks, particularly in fine-grained spatial and attribute control of entities.

Seg2Any: Open-set Segmentation-Mask-to-Image Generation with Precise Shape and Semantic Control

TL;DR

Seg2Any tackles the challenge of achieving precise spatial layout and semantic control in open-set segmentation-mask-to-image generation. It introduces a semantic-shape decoupling framework that uses a Semantic Alignment Attention Mask and an Entity Contour Map via Sparse Shape Feature Adaptation, along with an Attribute Isolation Attention Mask to prevent cross-entity leakage. The authors contribute SACap-1M, a 1M-image open-set dataset with 5.9M regional captions, plus SACap-Eval for open-set evaluation, enabling robust benchmarking. Experiments show state-of-the-art results on both open-set and closed-set S2I tasks, with strong performance in fine-grained spatial and attribute control and competitive global image quality.

Abstract

Despite recent advances in diffusion models, top-tier text-to-image (T2I) models still struggle to achieve precise spatial layout control, i.e. accurately generating entities with specified attributes and locations. Segmentation-mask-to-image (S2I) generation has emerged as a promising solution by incorporating pixel-level spatial guidance and regional text prompts. However, existing S2I methods fail to simultaneously ensure semantic consistency and shape consistency. To address these challenges, we propose Seg2Any, a novel S2I framework built upon advanced multimodal diffusion transformers (e.g. FLUX). First, to achieve both semantic and shape consistency, we decouple segmentation mask conditions into regional semantic and high-frequency shape components. The regional semantic condition is introduced by a Semantic Alignment Attention Mask, ensuring that generated entities adhere to their assigned text prompts. The high-frequency shape condition, representing entity boundaries, is encoded as an Entity Contour Map and then introduced as an additional modality via multi-modal attention to guide image spatial structure. Second, to prevent attribute leakage across entities in multi-entity scenarios, we introduce an Attribute Isolation Attention Mask mechanism, which constrains each entity's image tokens to attend exclusively to themselves during image self-attention. To support open-set S2I generation, we construct SACap-1M, a large-scale dataset containing 1 million images with 5.9 million segmented entities and detailed regional captions, along with a SACap-Eval benchmark for comprehensive S2I evaluation. Extensive experiments demonstrate that Seg2Any achieves state-of-the-art performance on both open-set and closed-set S2I benchmarks, particularly in fine-grained spatial and attribute control of entities.

Paper Structure

This paper contains 39 sections, 7 equations, 14 figures, 7 tables.

Figures (14)

  • Figure 1: We propose Seg2Any, a novel segmentation-mask-to-image generation approach that achieves strong shape consistency and fine-grained attribute control (e.g. color, style, and text).
  • Figure 2: Comparison in terms of shape and semantic consistency. Semantic inconsistency is annotated by blue boxes, while shape inconsistency is highlighted with red boxes, which reveal inconsistency in the number of vertical bars on the railings. In contrast, our approach achieves both shape and semantic consistency.
  • Figure 3: (a) An overview of the Seg2Any framework. Segmentation masks are transformed into Entity Contour Map, then encoded as condition tokens via frozen VAE. Negligible tokens are filtered out for efficiency. The resulting text, image, and condition tokens are concatenated into a unified sequence for MM-Attention. Our framework applies LoRA to all branches, achieving S2I generation with minimal extra parameters. (b) Attention Masks in MM-Attention, including Semantic Alignment Attention Mask (Section \ref{['subsubsec:semantic_integration']}) and Attribute Isolation Attention Mask (Section \ref{['subsection:attribute_isolation_attention_mask']}).
  • Figure 4: Visualization results of different attribute isolation strategies. In Column 1, 20 colored circular badges labeled A to T are required to be generated in raster order. The results show that our Attribute Isolation Attention Mask effectively prevents attribute leakage between entities. Columns 2-4 demonstrate that direct application of the mask without training leads to visual inconsistencies, manifesting as unnatural shadows and reflections. In contrast, the training-based approach on our proposed large-scale dataset achieves both strong attribute control and high visual coherence.
  • Figure 5: Qualitative comparisons on SACap-Eval. Seg2Any accurately generates entities exhibiting complex attributes such as color and texture, surpassing previous approaches.
  • ...and 9 more figures