Table of Contents
Fetching ...

A Comparative Survey of Vision Transformers for Feature Extraction in Texture Analysis

Leonardo Scabini, Andre Sacilotti, Kallil M. Zielinski, Lucas C. Ribas, Bernard De Baets, Odemir M. Bruno

TL;DR

The paper addresses texture analysis by evaluating 21 pre-trained Vision Transformer (ViT) variants as fixed-feature extractors for texture recognition, comparing them to hand-engineered methods and CNN baselines. It uses the ViT class embedding, no fine-tuning, and trains linear classifiers on extracted features, benchmarking across datasets including Outex, DTD, FMD, and KTH-TIPS2-b to assess robustness to rotation, scale, illumination, and in-the-wild conditions. A key finding is that ViTs, especially with strong pretraining such as IN-21k or self-supervised methods like DINO and BeiT v2, generally outperform baselines, with patch embeddings and self-supervised training being critical for texture discrimination; however, efficiency remains a trade-off, with some mobile variants offering better real-world practicality and larger models achieving higher throughput on GPUs. The work provides attention-map analyses to interpret model focus and highlights the potential of ViTs as a paradigm shift in texture feature extraction, while calling for optimized architectures and aggregation techniques tailored to texture tasks.

Abstract

Texture, a significant visual attribute in images, has been extensively investigated across various image recognition applications. Convolutional Neural Networks (CNNs), which have been successful in many computer vision tasks, are currently among the best texture analysis approaches. On the other hand, Vision Transformers (ViTs) have been surpassing the performance of CNNs on tasks such as object recognition, causing a paradigm shift in the field. However, ViTs have so far not been scrutinized for texture recognition, hindering a proper appreciation of their potential in this specific setting. For this reason, this work explores various pre-trained ViT architectures when transferred to tasks that rely on textures. We review 21 different ViT variants and perform an extensive evaluation and comparison with CNNs and hand-engineered models on several tasks, such as assessing robustness to changes in texture rotation, scale, and illumination, and distinguishing color textures, material textures, and texture attributes. The goal is to understand the potential and differences among these models when directly applied to texture recognition, using pre-trained ViTs primarily for feature extraction and employing linear classifiers for evaluation. We also evaluate their efficiency, which is one of the main drawbacks in contrast to other methods. Our results show that ViTs generally outperform both CNNs and hand-engineered models, especially when using stronger pre-training and tasks involving in-the-wild textures (images from the internet). We highlight the following promising models: ViT-B with DINO pre-training, BeiTv2, and the Swin architecture, as well as the EfficientFormer as a low-cost alternative. In terms of efficiency, although having a higher number of GFLOPs and parameters, ViT-B and BeiT(v2) can achieve a lower feature extraction time on GPUs compared to ResNet50.

A Comparative Survey of Vision Transformers for Feature Extraction in Texture Analysis

TL;DR

The paper addresses texture analysis by evaluating 21 pre-trained Vision Transformer (ViT) variants as fixed-feature extractors for texture recognition, comparing them to hand-engineered methods and CNN baselines. It uses the ViT class embedding, no fine-tuning, and trains linear classifiers on extracted features, benchmarking across datasets including Outex, DTD, FMD, and KTH-TIPS2-b to assess robustness to rotation, scale, illumination, and in-the-wild conditions. A key finding is that ViTs, especially with strong pretraining such as IN-21k or self-supervised methods like DINO and BeiT v2, generally outperform baselines, with patch embeddings and self-supervised training being critical for texture discrimination; however, efficiency remains a trade-off, with some mobile variants offering better real-world practicality and larger models achieving higher throughput on GPUs. The work provides attention-map analyses to interpret model focus and highlights the potential of ViTs as a paradigm shift in texture feature extraction, while calling for optimized architectures and aggregation techniques tailored to texture tasks.

Abstract

Texture, a significant visual attribute in images, has been extensively investigated across various image recognition applications. Convolutional Neural Networks (CNNs), which have been successful in many computer vision tasks, are currently among the best texture analysis approaches. On the other hand, Vision Transformers (ViTs) have been surpassing the performance of CNNs on tasks such as object recognition, causing a paradigm shift in the field. However, ViTs have so far not been scrutinized for texture recognition, hindering a proper appreciation of their potential in this specific setting. For this reason, this work explores various pre-trained ViT architectures when transferred to tasks that rely on textures. We review 21 different ViT variants and perform an extensive evaluation and comparison with CNNs and hand-engineered models on several tasks, such as assessing robustness to changes in texture rotation, scale, and illumination, and distinguishing color textures, material textures, and texture attributes. The goal is to understand the potential and differences among these models when directly applied to texture recognition, using pre-trained ViTs primarily for feature extraction and employing linear classifiers for evaluation. We also evaluate their efficiency, which is one of the main drawbacks in contrast to other methods. Our results show that ViTs generally outperform both CNNs and hand-engineered models, especially when using stronger pre-training and tasks involving in-the-wild textures (images from the internet). We highlight the following promising models: ViT-B with DINO pre-training, BeiTv2, and the Swin architecture, as well as the EfficientFormer as a low-cost alternative. In terms of efficiency, although having a higher number of GFLOPs and parameters, ViT-B and BeiT(v2) can achieve a lower feature extraction time on GPUs compared to ResNet50.
Paper Structure (20 sections, 4 equations, 5 figures, 4 tables)

This paper contains 20 sections, 4 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: The usual pipeline in texture analysis. Texture samples must be encoded in meaningful image representations using feature extraction techniques, which could be either hand-engineered (usually designed specifically for textures), or based on learning models (e.g., from pre-trained deep neural networks). These representations can be used for pattern recognition tasks in a variety of applications.
  • Figure 2: The general elements of a Vision Transformer (a). One of its most important modules is the image embedding (a.k.a. tokenizer), which is responsible for preparing the pixels in a way that Transformer Encoders (b) can learn and extract meaningful visual patterns.
  • Figure 3: Texture samples from the eight image datasets used in this work. For each dataset, each column represents a different texture class, while each row represent different samples from that class.
  • Figure 4: Efficiency analysis of ViT variants compared to hand-engineered and CNN baselines, where accuracy represents the average accuracy over the corresponding datasets and classifiers considered (KNN, LDA, and SVM). The yellow line with the smaller dots represents ResNet50 with IN-21k pre-training.
  • Figure 5: Visualization of attention maps (at the last layer) of different ViT models (d-e) for texture samples (a-c) from the FMD dataset.