FashionFAE: Fine-grained Attributes Enhanced Fashion Vision-Language Pre-training

Jiale Huang; Dehong Gao; Jinxia Zhang; Zechao Zhan; Yang Hu; Xin Wang

FashionFAE: Fine-grained Attributes Enhanced Fashion Vision-Language Pre-training

Jiale Huang, Dehong Gao, Jinxia Zhang, Zechao Zhan, Yang Hu, Xin Wang

TL;DR

FashionFAE tackles the need for fine-grained attribute understanding in fashion vision-language pre-training by introducing Attribute-Emphasized Text Prediction (AETP) and Attribute-Promoted Image Reconstruction (APIR). These tasks push the model to extract salient textual attributes and reconstruct image patches in an attribute-aware latent space, using a ViT image encoder and a BERT-based fusion module. The framework optimizes a joint objective over five pre-training tasks—$$L$_{AETP}$, $$L$_{APIR}$, $$L$_{ITC}$, $$L$_{MLM}$, and $$L$_{ITM}$—with task sampling to balance learning. On FashionGen, FashionFAE achieves state-of-the-art results in cross-modal retrieval (sub-test mean improvement $2.9\%$, full test $5.2\%$) and category/subcategory recognition (average ~${2.6}\%$), validating the value of explicitly modeling fine-grained attributes in both text and image modalities for fashion applications.

Abstract

Large-scale Vision-Language Pre-training (VLP) has demonstrated remarkable success in the general domain. However, in the fashion domain, items are distinguished by fine-grained attributes like texture and material, which are crucial for tasks such as retrieval. Existing models often fail to leverage these fine-grained attributes from both text and image modalities. To address the above issues, we propose a novel approach for the fashion domain, Fine-grained Attributes Enhanced VLP (FashionFAE), which focuses on the detailed characteristics of fashion data. An attribute-emphasized text prediction task is proposed to predict fine-grained attributes of the items. This forces the model to focus on the salient attributes from the text modality. Additionally, a novel attribute-promoted image reconstruction task is proposed, which further enhances the fine-grained ability of the model by leveraging the representative attributes from the image modality. Extensive experiments show that FashionFAE significantly outperforms State-Of-The-Art (SOTA) methods, achieving 2.9% and 5.2% improvements in retrieval on sub-test and full test sets, respectively, and a 1.6% average improvement in recognition tasks.

FashionFAE: Fine-grained Attributes Enhanced Fashion Vision-Language Pre-training

TL;DR

, and

—with task sampling to balance learning. On FashionGen, FashionFAE achieves state-of-the-art results in cross-modal retrieval (sub-test mean improvement

, full test

) and category/subcategory recognition (average ~

), validating the value of explicitly modeling fine-grained attributes in both text and image modalities for fashion applications.

Abstract

Paper Structure (17 sections, 9 equations, 3 figures, 5 tables)

This paper contains 17 sections, 9 equations, 3 figures, 5 tables.

Introduction
Method
Model Overview
Pre-training Tasks
Attribute-Emphasized Text Prediction (AETP)
Attribute-Promoted Image Reconstruction (APIR)
Image-Text Contrastive Learning (ITC)
Masked Language Modeling (MLM)
Image-Text Matching (ITM)
Experiments
Dataset
Implementation Details
Downstream Tasks and Results
Cross-modal Retrieval
Category/Subcategory Recognition (CR&SCR)
...and 2 more sections

Figures (3)

Figure 1: FashionFAE achieves SOTA performance in various metrics for cross-modal retrieval and (sub)category recognition in the fashion domain.
Figure 2: Overview of our FashionFAE model architecture and proposed AETP and APIR tasks. To accommodate the different 5 pre-training tasks, the model has a total of 3 modes: (a) Contrastive mode; (b) Fusion mode; (c) Image reconstruction mode.
Figure 3: The visual representations of black shirts, black sweaters, and black pants are indistinguishable by showing similarities in the visual space but can be well differentiated in more advanced attribute space.

FashionFAE: Fine-grained Attributes Enhanced Fashion Vision-Language Pre-training

TL;DR

Abstract

FashionFAE: Fine-grained Attributes Enhanced Fashion Vision-Language Pre-training

Authors

TL;DR

Abstract

Table of Contents

Figures (3)