Table of Contents
Fetching ...

Convolutional Neural Networks and Vision Transformers for Fashion MNIST Classification: A Literature Review

Sonia Bbouzidi, Ghazala Hcini, Imen Jdey, Fadoua Drira

TL;DR

The paper surveys CNNs, Vision Transformers (ViTs), and hybrid CNN–ViT architectures for Fashion MNIST in the context of fashion e-commerce, comparing local feature extraction with global context modeling. It details representative CNN and ViT approaches, key hybridization strategies (parallel, sequential, hierarchical), and dataset/methodology considerations, including common metrics. The review notes that CNNs achieve very high accuracies (e.g., $99.1\%$ with cnn-dropout-3) while ViTs reach competitive performance (up to $95.25\%$ on Fashion MNIST) and that hybrid models such as TSD, CAReNet, MixMobileNet, and HSViT can surpass individual architectures (e.g., up to $96.56\%$). It emphasizes practical impact for fashion AI in e-commerce, including improved clothing classification, personalized recommendations, and robust visual search, while outlining open challenges like data efficiency, interpretability, and computational demands for future work.

Abstract

Our review explores the comparative analysis between Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) in the domain of image classification, with a particular focus on clothing classification within the e-commerce sector. Utilizing the Fashion MNIST dataset, we delve into the unique attributes of CNNs and ViTs. While CNNs have long been the cornerstone of image classification, ViTs introduce an innovative self-attention mechanism enabling nuanced weighting of different input data components. Historically, transformers have primarily been associated with Natural Language Processing (NLP) tasks. Through a comprehensive examination of existing literature, our aim is to unveil the distinctions between ViTs and CNNs in the context of image classification. Our analysis meticulously scrutinizes state-of-the-art methodologies employing both architectures, striving to identify the factors influencing their performance. These factors encompass dataset characteristics, image dimensions, the number of target classes, hardware infrastructure, and the specific architectures along with their respective top results. Our key goal is to determine the most appropriate architecture between ViT and CNN for classifying images in the Fashion MNIST dataset within the e-commerce industry, while taking into account specific conditions and needs. We highlight the importance of combining these two architectures with different forms to enhance overall performance. By uniting these architectures, we can take advantage of their unique strengths, which may lead to more precise and reliable models for e-commerce applications. CNNs are skilled at recognizing local patterns, while ViTs are effective at grasping overall context, making their combination a promising strategy for boosting image classification performance.

Convolutional Neural Networks and Vision Transformers for Fashion MNIST Classification: A Literature Review

TL;DR

The paper surveys CNNs, Vision Transformers (ViTs), and hybrid CNN–ViT architectures for Fashion MNIST in the context of fashion e-commerce, comparing local feature extraction with global context modeling. It details representative CNN and ViT approaches, key hybridization strategies (parallel, sequential, hierarchical), and dataset/methodology considerations, including common metrics. The review notes that CNNs achieve very high accuracies (e.g., with cnn-dropout-3) while ViTs reach competitive performance (up to on Fashion MNIST) and that hybrid models such as TSD, CAReNet, MixMobileNet, and HSViT can surpass individual architectures (e.g., up to ). It emphasizes practical impact for fashion AI in e-commerce, including improved clothing classification, personalized recommendations, and robust visual search, while outlining open challenges like data efficiency, interpretability, and computational demands for future work.

Abstract

Our review explores the comparative analysis between Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) in the domain of image classification, with a particular focus on clothing classification within the e-commerce sector. Utilizing the Fashion MNIST dataset, we delve into the unique attributes of CNNs and ViTs. While CNNs have long been the cornerstone of image classification, ViTs introduce an innovative self-attention mechanism enabling nuanced weighting of different input data components. Historically, transformers have primarily been associated with Natural Language Processing (NLP) tasks. Through a comprehensive examination of existing literature, our aim is to unveil the distinctions between ViTs and CNNs in the context of image classification. Our analysis meticulously scrutinizes state-of-the-art methodologies employing both architectures, striving to identify the factors influencing their performance. These factors encompass dataset characteristics, image dimensions, the number of target classes, hardware infrastructure, and the specific architectures along with their respective top results. Our key goal is to determine the most appropriate architecture between ViT and CNN for classifying images in the Fashion MNIST dataset within the e-commerce industry, while taking into account specific conditions and needs. We highlight the importance of combining these two architectures with different forms to enhance overall performance. By uniting these architectures, we can take advantage of their unique strengths, which may lead to more precise and reliable models for e-commerce applications. CNNs are skilled at recognizing local patterns, while ViTs are effective at grasping overall context, making their combination a promising strategy for boosting image classification performance.
Paper Structure (23 sections, 11 figures, 12 tables)

This paper contains 23 sections, 11 figures, 12 tables.

Figures (11)

  • Figure 1: The evolution of online sales turnover since 2014 and predictions for 2024.
  • Figure 2: Estimated annual spending in each consumer goods e-commerce category in 2022.
  • Figure 3: Different architectures of ViT and CNN.
  • Figure 4: Number of papers using Vision Transformer and different CNN architectures in image classification tasks between 2018 and 2024.
  • Figure 5: The standard CNN architecture.
  • ...and 6 more figures