Table of Contents
Fetching ...

FashionFail: Addressing Failure Cases in Fashion Object Detection and Segmentation

Riza Velioglu, Robin Chan, Barbara Hammer

TL;DR

FashionFail; a new fashion dataset with e-commerce images for object detection and segmentation for online shopping images, which reveals the shortcomings of leading models, such as Attribute-Mask R-CNN and Fashionformer.

Abstract

In the realm of fashion object detection and segmentation for online shopping images, existing state-of-the-art fashion parsing models encounter limitations, particularly when exposed to non-model-worn apparel and close-up shots. To address these failures, we introduce FashionFail; a new fashion dataset with e-commerce images for object detection and segmentation. The dataset is efficiently curated using our novel annotation tool that leverages recent foundation models. The primary objective of FashionFail is to serve as a test bed for evaluating the robustness of models. Our analysis reveals the shortcomings of leading models, such as Attribute-Mask R-CNN and Fashionformer. Additionally, we propose a baseline approach using naive data augmentation to mitigate common failure cases and improve model robustness. Through this work, we aim to inspire and support further research in fashion item detection and segmentation for industrial applications. The dataset, annotation tool, code, and models are available at \url{https://rizavelioglu.github.io/fashionfail/}.

FashionFail: Addressing Failure Cases in Fashion Object Detection and Segmentation

TL;DR

FashionFail; a new fashion dataset with e-commerce images for object detection and segmentation for online shopping images, which reveals the shortcomings of leading models, such as Attribute-Mask R-CNN and Fashionformer.

Abstract

In the realm of fashion object detection and segmentation for online shopping images, existing state-of-the-art fashion parsing models encounter limitations, particularly when exposed to non-model-worn apparel and close-up shots. To address these failures, we introduce FashionFail; a new fashion dataset with e-commerce images for object detection and segmentation. The dataset is efficiently curated using our novel annotation tool that leverages recent foundation models. The primary objective of FashionFail is to serve as a test bed for evaluating the robustness of models. Our analysis reveals the shortcomings of leading models, such as Attribute-Mask R-CNN and Fashionformer. Additionally, we propose a baseline approach using naive data augmentation to mitigate common failure cases and improve model robustness. Through this work, we aim to inspire and support further research in fashion item detection and segmentation for industrial applications. The dataset, annotation tool, code, and models are available at \url{https://rizavelioglu.github.io/fashionfail/}.
Paper Structure (13 sections, 5 equations, 9 figures, 2 tables)

This paper contains 13 sections, 5 equations, 9 figures, 2 tables.

Figures (9)

  • Figure 1: Image examples from FashionFail-test and prediction failure cases of two state-of-the-art models on fashion detection, that are Attribute-Mask R-CNNSpineNet-143jia2020fashionpedia and FashionformerSwin-basexu2022fashionformer.
  • Figure 2: The effects of scale and context on bounding box predictions of Attribute-Mask R-CNNSpineNet-143. The original input images are the first from left. Oversized items and missing context lead to incorrect or non-detections.
  • Figure 3: Screenshots of our filtering tool for three images. Annotators can label images with the click of a button (bottom left) or use a keyboard shortcut for faster labeling. The top-left displays statistics such as speed (images per second) and expected time left, while the top-right shows the total number of images to label.
  • Figure 4: The annotation pipeline employed in curating FashionFail. Left: GPT-3.5 brown2020language is prompted with the product description to predict an apparel label. Middle: Grounding DINO liu2023grounding, when provided with the product image and a generic text prompt like "an object", accurately derives bounding box coordinates for all categories. Right: SAM kirillov2023segment, in conjunction with the box coordinates and the product image, produces precise segmentation masks.
  • Figure 5: A screenshot of our simple tool used during the data quality review. Left: original image and the label generated by the LLM. Middle: the binary segmentation mask (white pixels) obtained from SAM. Right: visualization of the bounding box generated by Grounding DINO.
  • ...and 4 more figures