Table of Contents
Fetching ...

FashionFAE: Fine-grained Attributes Enhanced Fashion Vision-Language Pre-training

Jiale Huang, Dehong Gao, Jinxia Zhang, Zechao Zhan, Yang Hu, Xin Wang

TL;DR

FashionFAE tackles the need for fine-grained attribute understanding in fashion vision-language pre-training by introducing Attribute-Emphasized Text Prediction (AETP) and Attribute-Promoted Image Reconstruction (APIR). These tasks push the model to extract salient textual attributes and reconstruct image patches in an attribute-aware latent space, using a ViT image encoder and a BERT-based fusion module. The framework optimizes a joint objective over five pre-training tasksβ€”$$L$_{AETP}$, $$L$_{APIR}$, $$L$_{ITC}$, $$L$_{MLM}$, and $$L$_{ITM}$β€”with task sampling to balance learning. On FashionGen, FashionFAE achieves state-of-the-art results in cross-modal retrieval (sub-test mean improvement $2.9\%$, full test $5.2\%$) and category/subcategory recognition (average ~${2.6}\%$), validating the value of explicitly modeling fine-grained attributes in both text and image modalities for fashion applications.

Abstract

Large-scale Vision-Language Pre-training (VLP) has demonstrated remarkable success in the general domain. However, in the fashion domain, items are distinguished by fine-grained attributes like texture and material, which are crucial for tasks such as retrieval. Existing models often fail to leverage these fine-grained attributes from both text and image modalities. To address the above issues, we propose a novel approach for the fashion domain, Fine-grained Attributes Enhanced VLP (FashionFAE), which focuses on the detailed characteristics of fashion data. An attribute-emphasized text prediction task is proposed to predict fine-grained attributes of the items. This forces the model to focus on the salient attributes from the text modality. Additionally, a novel attribute-promoted image reconstruction task is proposed, which further enhances the fine-grained ability of the model by leveraging the representative attributes from the image modality. Extensive experiments show that FashionFAE significantly outperforms State-Of-The-Art (SOTA) methods, achieving 2.9% and 5.2% improvements in retrieval on sub-test and full test sets, respectively, and a 1.6% average improvement in recognition tasks.

FashionFAE: Fine-grained Attributes Enhanced Fashion Vision-Language Pre-training

TL;DR

FashionFAE tackles the need for fine-grained attribute understanding in fashion vision-language pre-training by introducing Attribute-Emphasized Text Prediction (AETP) and Attribute-Promoted Image Reconstruction (APIR). These tasks push the model to extract salient textual attributes and reconstruct image patches in an attribute-aware latent space, using a ViT image encoder and a BERT-based fusion module. The framework optimizes a joint objective over five pre-training tasksβ€”L, L, L, L, and Lβ€”with task sampling to balance learning. On FashionGen, FashionFAE achieves state-of-the-art results in cross-modal retrieval (sub-test mean improvement , full test ) and category/subcategory recognition (average ~), validating the value of explicitly modeling fine-grained attributes in both text and image modalities for fashion applications.

Abstract

Large-scale Vision-Language Pre-training (VLP) has demonstrated remarkable success in the general domain. However, in the fashion domain, items are distinguished by fine-grained attributes like texture and material, which are crucial for tasks such as retrieval. Existing models often fail to leverage these fine-grained attributes from both text and image modalities. To address the above issues, we propose a novel approach for the fashion domain, Fine-grained Attributes Enhanced VLP (FashionFAE), which focuses on the detailed characteristics of fashion data. An attribute-emphasized text prediction task is proposed to predict fine-grained attributes of the items. This forces the model to focus on the salient attributes from the text modality. Additionally, a novel attribute-promoted image reconstruction task is proposed, which further enhances the fine-grained ability of the model by leveraging the representative attributes from the image modality. Extensive experiments show that FashionFAE significantly outperforms State-Of-The-Art (SOTA) methods, achieving 2.9% and 5.2% improvements in retrieval on sub-test and full test sets, respectively, and a 1.6% average improvement in recognition tasks.
Paper Structure (17 sections, 9 equations, 3 figures, 5 tables)

This paper contains 17 sections, 9 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: FashionFAE achieves SOTA performance in various metrics for cross-modal retrieval and (sub)category recognition in the fashion domain.
  • Figure 2: Overview of our FashionFAE model architecture and proposed AETP and APIR tasks. To accommodate the different 5 pre-training tasks, the model has a total of 3 modes: (a) Contrastive mode; (b) Fusion mode; (c) Image reconstruction mode.
  • Figure 3: The visual representations of black shirts, black sweaters, and black pants are indistinguishable by showing similarities in the visual space but can be well differentiated in more advanced attribute space.