Seg2Any: Open-set Segmentation-Mask-to-Image Generation with Precise Shape and Semantic Control

Danfeng Li; Hui Zhang; Sheng Wang; Jiacheng Li; Zuxuan Wu

Seg2Any: Open-set Segmentation-Mask-to-Image Generation with Precise Shape and Semantic Control

Danfeng Li, Hui Zhang, Sheng Wang, Jiacheng Li, Zuxuan Wu

TL;DR

Seg2Any tackles the challenge of achieving precise spatial layout and semantic control in open-set segmentation-mask-to-image generation. It introduces a semantic-shape decoupling framework that uses a Semantic Alignment Attention Mask and an Entity Contour Map via Sparse Shape Feature Adaptation, along with an Attribute Isolation Attention Mask to prevent cross-entity leakage. The authors contribute SACap-1M, a 1M-image open-set dataset with 5.9M regional captions, plus SACap-Eval for open-set evaluation, enabling robust benchmarking. Experiments show state-of-the-art results on both open-set and closed-set S2I tasks, with strong performance in fine-grained spatial and attribute control and competitive global image quality.

Abstract

Despite recent advances in diffusion models, top-tier text-to-image (T2I) models still struggle to achieve precise spatial layout control, i.e. accurately generating entities with specified attributes and locations. Segmentation-mask-to-image (S2I) generation has emerged as a promising solution by incorporating pixel-level spatial guidance and regional text prompts. However, existing S2I methods fail to simultaneously ensure semantic consistency and shape consistency. To address these challenges, we propose Seg2Any, a novel S2I framework built upon advanced multimodal diffusion transformers (e.g. FLUX). First, to achieve both semantic and shape consistency, we decouple segmentation mask conditions into regional semantic and high-frequency shape components. The regional semantic condition is introduced by a Semantic Alignment Attention Mask, ensuring that generated entities adhere to their assigned text prompts. The high-frequency shape condition, representing entity boundaries, is encoded as an Entity Contour Map and then introduced as an additional modality via multi-modal attention to guide image spatial structure. Second, to prevent attribute leakage across entities in multi-entity scenarios, we introduce an Attribute Isolation Attention Mask mechanism, which constrains each entity's image tokens to attend exclusively to themselves during image self-attention. To support open-set S2I generation, we construct SACap-1M, a large-scale dataset containing 1 million images with 5.9 million segmented entities and detailed regional captions, along with a SACap-Eval benchmark for comprehensive S2I evaluation. Extensive experiments demonstrate that Seg2Any achieves state-of-the-art performance on both open-set and closed-set S2I benchmarks, particularly in fine-grained spatial and attribute control of entities.

Seg2Any: Open-set Segmentation-Mask-to-Image Generation with Precise Shape and Semantic Control

TL;DR

Abstract

Seg2Any: Open-set Segmentation-Mask-to-Image Generation with Precise Shape and Semantic Control

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (14)