Table of Contents
Fetching ...

VORTEX: Challenging CNNs at Texture Recognition by using Vision Transformers with Orderless and Randomized Token Encodings

Leonardo Scabini, Kallil M. Zielinski, Emir Konuk, Ricardo T. Fares, Lucas C. Ribas, Kevin Smith, Odemir M. Bruno

TL;DR

VORTEX addresses texture recognition by retooling pre-trained ViTs as frozen feature extractors and applying texture-focused feature engineering. It aggregates multi-depth token embeddings from ViT backbones into χ ∈ \\mathbb{R}^{ln imes d} and then derives a compact, orderless descriptor φ_m via m randomized autoencoders, delivering a robust texture representation to a linear classifier. Across nine texture datasets, VORTEX achieves state-of-the-art or competitive results, often surpassing CNN-based methods and vanilla ViT features, while preserving efficiency by avoiding backbone fine-tuning. This work demonstrates the viability of transformer foundation models for texture analysis and provides a scalable, plug-in approach for texture-centric applications with strong practical impact.

Abstract

Texture recognition has recently been dominated by ImageNet-pre-trained deep Convolutional Neural Networks (CNNs), with specialized modifications and feature engineering required to achieve state-of-the-art (SOTA) performance. However, although Vision Transformers (ViTs) were introduced a few years ago, little is known about their texture recognition ability. Therefore, in this work, we introduce VORTEX (ViTs with Orderless and Randomized Token Encodings for Texture Recognition), a novel method that enables the effective use of ViTs for texture analysis. VORTEX extracts multi-depth token embeddings from pre-trained ViT backbones and employs a lightweight module to aggregate hierarchical features and perform orderless encoding, obtaining a better image representation for texture recognition tasks. This approach allows seamless integration with any ViT with the common transformer architecture. Moreover, no fine-tuning of the backbone is performed, since they are used only as frozen feature extractors, and the features are fed to a linear SVM. We evaluate VORTEX on nine diverse texture datasets, demonstrating its ability to achieve or surpass SOTA performance in a variety of texture analysis scenarios. By bridging the gap between texture recognition with CNNs and transformer-based architectures, VORTEX paves the way for adopting emerging transformer foundation models. Furthermore, VORTEX demonstrates robust computational efficiency when coupled with ViT backbones compared to CNNs with similar costs. The method implementation and experimental scripts are publicly available in our online repository.

VORTEX: Challenging CNNs at Texture Recognition by using Vision Transformers with Orderless and Randomized Token Encodings

TL;DR

VORTEX addresses texture recognition by retooling pre-trained ViTs as frozen feature extractors and applying texture-focused feature engineering. It aggregates multi-depth token embeddings from ViT backbones into χ ∈ \\mathbb{R}^{ln imes d} and then derives a compact, orderless descriptor φ_m via m randomized autoencoders, delivering a robust texture representation to a linear classifier. Across nine texture datasets, VORTEX achieves state-of-the-art or competitive results, often surpassing CNN-based methods and vanilla ViT features, while preserving efficiency by avoiding backbone fine-tuning. This work demonstrates the viability of transformer foundation models for texture analysis and provides a scalable, plug-in approach for texture-centric applications with strong practical impact.

Abstract

Texture recognition has recently been dominated by ImageNet-pre-trained deep Convolutional Neural Networks (CNNs), with specialized modifications and feature engineering required to achieve state-of-the-art (SOTA) performance. However, although Vision Transformers (ViTs) were introduced a few years ago, little is known about their texture recognition ability. Therefore, in this work, we introduce VORTEX (ViTs with Orderless and Randomized Token Encodings for Texture Recognition), a novel method that enables the effective use of ViTs for texture analysis. VORTEX extracts multi-depth token embeddings from pre-trained ViT backbones and employs a lightweight module to aggregate hierarchical features and perform orderless encoding, obtaining a better image representation for texture recognition tasks. This approach allows seamless integration with any ViT with the common transformer architecture. Moreover, no fine-tuning of the backbone is performed, since they are used only as frozen feature extractors, and the features are fed to a linear SVM. We evaluate VORTEX on nine diverse texture datasets, demonstrating its ability to achieve or surpass SOTA performance in a variety of texture analysis scenarios. By bridging the gap between texture recognition with CNNs and transformer-based architectures, VORTEX paves the way for adopting emerging transformer foundation models. Furthermore, VORTEX demonstrates robust computational efficiency when coupled with ViT backbones compared to CNNs with similar costs. The method implementation and experimental scripts are publicly available in our online repository.

Paper Structure

This paper contains 18 sections, 8 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Illustration of the proposed method for feature engineering with pre-trained ViTs, VORTEX (a), used in this example with the vanilla ViT architecture (b) to produce a new image representation $\varphi_m$. The structure of the RAE is shown in (c), which is an analytically-solved 1-layer auto-encoder, where we use its decoder weights as the representation (souped for $m$ RAEs).
  • Figure 2: Attention scores computed by averaging the attention heads and self-similarity scores among spatial tokens at different layers of ViT-B/16 (IN-21k pre-training) dosovitskiy2020image. Warmer colors represent higher attention.
  • Figure 3: Some mixed samples from training and test sets from each public texture recognition benchmark used in this work.
  • Figure 4: Impacts of increasing the parameter $m$ (encoder soup size) of VORTEX compared to other feature extraction approaches (CLS token and GAP) when using the ViT-B/16 backbone (IN-21k). The result for each texture dataset (a, b) is the average classification accuracy among KNN, LDA, and SVM.
  • Figure 5: Performance of the SVM classifier using VORTEX features with different ViT backbones by varying size and pre-training on two datasets (a-b).
  • ...and 3 more figures