Table of Contents
Fetching ...

Attribute-Guided Multi-Level Attention Network for Fine-Grained Fashion Retrieval

Ling Xiao, Toshihiko Yamasaki

TL;DR

This work tackles fine-grained fashion retrieval by addressing the feature gap introduced when fine-tuning pre-trained backbones for attribute-specific tasks. The authors propose AG-MAN, which combines hierarchical multi-level feature extraction with an attribute-guided attention (AGA) module, plus a classification loss to perturb object-centric learning and diversify feature representations. The model achieves state-of-the-art performance on ASFR and TR tasks across FashionAI, DeepFashion, and Zappos50k, with ablations confirming the contribution of each component. The approach offers strong retrieval accuracy and interpretable attribute localization, with potential for scalable attribute-guided recognition in fashion e-commerce and copyright protection contexts.

Abstract

Fine-grained fashion retrieval searches for items that share a similar attribute with the query image. Most existing methods use a pre-trained feature extractor (e.g., ResNet 50) to capture image representations. However, a pre-trained feature backbone is typically trained for image classification and object detection, which are fundamentally different tasks from fine-grained fashion retrieval. Therefore, existing methods suffer from a feature gap problem when directly using the pre-trained backbone for fine-tuning. To solve this problem, we introduce an attribute-guided multi-level attention network (AG-MAN). Specifically, we first enhance the pre-trained feature extractor to capture multi-level image embedding, thereby enriching the low-level features within these representations. Then, we propose a classification scheme where images with the same attribute, albeit with different values, are categorized into the same class. This can further alleviate the feature gap problem by perturbing object-centric feature learning. Moreover, we propose an improved attribute-guided attention module for extracting more accurate attribute-specific representations. Our model consistently outperforms existing attention based methods when assessed on the FashionAI (62.8788% in MAP), DeepFashion (8.9804% in MAP), and Zappos50k datasets (93.32% in Prediction accuracy). Especially, ours improves the most typical ASENet_V2 model by 2.12%, 0.31%, and 0.78% points in FashionAI, DeepFashion, and Zappos50k datasets, respectively. The source code is available in https://github.com/Dr-LingXiao/AG-MAN.

Attribute-Guided Multi-Level Attention Network for Fine-Grained Fashion Retrieval

TL;DR

This work tackles fine-grained fashion retrieval by addressing the feature gap introduced when fine-tuning pre-trained backbones for attribute-specific tasks. The authors propose AG-MAN, which combines hierarchical multi-level feature extraction with an attribute-guided attention (AGA) module, plus a classification loss to perturb object-centric learning and diversify feature representations. The model achieves state-of-the-art performance on ASFR and TR tasks across FashionAI, DeepFashion, and Zappos50k, with ablations confirming the contribution of each component. The approach offers strong retrieval accuracy and interpretable attribute localization, with potential for scalable attribute-guided recognition in fashion e-commerce and copyright protection contexts.

Abstract

Fine-grained fashion retrieval searches for items that share a similar attribute with the query image. Most existing methods use a pre-trained feature extractor (e.g., ResNet 50) to capture image representations. However, a pre-trained feature backbone is typically trained for image classification and object detection, which are fundamentally different tasks from fine-grained fashion retrieval. Therefore, existing methods suffer from a feature gap problem when directly using the pre-trained backbone for fine-tuning. To solve this problem, we introduce an attribute-guided multi-level attention network (AG-MAN). Specifically, we first enhance the pre-trained feature extractor to capture multi-level image embedding, thereby enriching the low-level features within these representations. Then, we propose a classification scheme where images with the same attribute, albeit with different values, are categorized into the same class. This can further alleviate the feature gap problem by perturbing object-centric feature learning. Moreover, we propose an improved attribute-guided attention module for extracting more accurate attribute-specific representations. Our model consistently outperforms existing attention based methods when assessed on the FashionAI (62.8788% in MAP), DeepFashion (8.9804% in MAP), and Zappos50k datasets (93.32% in Prediction accuracy). Especially, ours improves the most typical ASENet_V2 model by 2.12%, 0.31%, and 0.78% points in FashionAI, DeepFashion, and Zappos50k datasets, respectively. The source code is available in https://github.com/Dr-LingXiao/AG-MAN.
Paper Structure (18 sections, 9 equations, 9 figures, 10 tables)

This paper contains 18 sections, 9 equations, 9 figures, 10 tables.

Figures (9)

  • Figure 1: Model architecture. #1, #2, #3, and #4 are four blocks in ResNet50 backbone.
  • Figure 2: The proposed classification branch perturbs the object-centric feature learning in the fine-tuning process by grouping images with the same attribute but different sub-classes into the same class. The number above the image denotes the sub-class under each attribute.
  • Figure 3: Details of (a) ASA, (b) SA, (c) ACA, and (d) CA in the proposed AGA module. The ASA and SA enhance the related region localization while the ACA and CA improve the ability to distinguish between different attributes within the same region.
  • Figure 4: The conceptual structure of compared models and ours. The CSN is a conditional similarity network while the the ASENet_V2 and AttnFashion are attention networks in this research field. Ours is also an attention network.
  • Figure 5: Top-10 retrieval examples from the FashionAI dataset, where the images with a black bounding box exhibit the same sub-classes with the query image, while images with a red bounding box have different sub-classes compared to the query image. The majority of the search results within the top 10 generated by our model are precise and correct.
  • ...and 4 more figures