Table of Contents
Fetching ...

GarmentAligner: Text-to-Garment Generation via Retrieval-augmented Multi-level Corrections

Shiyue Zhang, Zheng Chong, Xujie Zhang, Hanhui Li, Yuhao Cheng, Yiqiang Yan, Xiaodan Liang

TL;DR

This work proposes GarmentAligner, a text-to-garment diffusion model trained with retrieval-augmented multi-level corrections that achieves superior fidelity and fine-grained semantic alignment when compared to existing competitors.

Abstract

General text-to-image models bring revolutionary innovation to the fields of arts, design, and media. However, when applied to garment generation, even the state-of-the-art text-to-image models suffer from fine-grained semantic misalignment, particularly concerning the quantity, position, and interrelations of garment components. Addressing this, we propose GarmentAligner, a text-to-garment diffusion model trained with retrieval-augmented multi-level corrections. To achieve semantic alignment at the component level, we introduce an automatic component extraction pipeline to obtain spatial and quantitative information of garment components from corresponding images and captions. Subsequently, to exploit component relationships within the garment images, we construct retrieval subsets for each garment by retrieval augmentation based on component-level similarity ranking and conduct contrastive learning to enhance the model perception of components from positive and negative samples. To further enhance the alignment of components across semantic, spatial, and quantitative granularities, we propose the utilization of multi-level correction losses that leverage detailed component information. The experimental findings demonstrate that GarmentAligner achieves superior fidelity and fine-grained semantic alignment when compared to existing competitors.

GarmentAligner: Text-to-Garment Generation via Retrieval-augmented Multi-level Corrections

TL;DR

This work proposes GarmentAligner, a text-to-garment diffusion model trained with retrieval-augmented multi-level corrections that achieves superior fidelity and fine-grained semantic alignment when compared to existing competitors.

Abstract

General text-to-image models bring revolutionary innovation to the fields of arts, design, and media. However, when applied to garment generation, even the state-of-the-art text-to-image models suffer from fine-grained semantic misalignment, particularly concerning the quantity, position, and interrelations of garment components. Addressing this, we propose GarmentAligner, a text-to-garment diffusion model trained with retrieval-augmented multi-level corrections. To achieve semantic alignment at the component level, we introduce an automatic component extraction pipeline to obtain spatial and quantitative information of garment components from corresponding images and captions. Subsequently, to exploit component relationships within the garment images, we construct retrieval subsets for each garment by retrieval augmentation based on component-level similarity ranking and conduct contrastive learning to enhance the model perception of components from positive and negative samples. To further enhance the alignment of components across semantic, spatial, and quantitative granularities, we propose the utilization of multi-level correction losses that leverage detailed component information. The experimental findings demonstrate that GarmentAligner achieves superior fidelity and fine-grained semantic alignment when compared to existing competitors.
Paper Structure (14 sections, 7 equations, 10 figures, 2 tables)

This paper contains 14 sections, 7 equations, 10 figures, 2 tables.

Figures (10)

  • Figure 1: GarmentAligner is capable of producing high-quality garment images accurately depicting the quantity and spatial alignment of components specified in the provided captions.
  • Figure 2: The failures of state-of-the-art text-to-image model Midjourney in the text-to-garment task. The misalignment is primarily attributed to the quantities and spatial positions of components, thereby making it difficult to generate garments that meet the expected fine-grained details.
  • Figure 3: The illustration of misalignment is addressed by retrieval-augmented contrastive learning. By assimilating insights from positive and negative samples retrieved via component-level similarity ranking, the model enhances its perception of component relationships.
  • Figure 4: The overview of the proposed GarmentAligner. During training, retrieval samples are systematically constructed, utilizing multi-level semantic similarity ranking for contrastive learning, with the objective of attaining global perceptual alignment. Simultaneously, multiple correction losses are employed to refine the visual semantics, spatial positions, and the quantity of garment components, thereby augmenting the granularity of details.
  • Figure 5: The illustration of proposed Retrieval-augmented Contrastive Learning. Retrieval for each sample is performed within a randomly selected subset containing $N$ samples, based on component-level semantic similarity ranking. Subsequently, the retrieval outcomes undergo global assessment filtering to acquire positive and negative samples for contrastive learning.
  • ...and 5 more figures