Table of Contents
Fetching ...

Deep Neural Networks Fused with Textures for Image Classification

Asish Bera, Debotosh Bhattacharjee, Mita Nasipuri

TL;DR

This work addresses FGIC by fusing global texture information with local patch-based deep features. It introduces DNT, a two-stream model where patches from a base CNN are encoded by an LSTM and complemented by multi-scale LBP texture histograms, with both streams fused for classification. Empirical results across eight diverse FGIC datasets and four backbones show accuracy gains and validate the contribution of patch encoding, texture descriptors, and the random region erasing augmentation. The approach demonstrates a practical, robust pathway for improving fine-grained visual recognition by leveraging complementary cues from deep representations and texture patterns.

Abstract

Fine-grained image classification (FGIC) is a challenging task in computer vision for due to small visual differences among inter-subcategories, but, large intra-class variations. Deep learning methods have achieved remarkable success in solving FGIC. In this paper, we propose a fusion approach to address FGIC by combining global texture with local patch-based information. The first pipeline extracts deep features from various fixed-size non-overlapping patches and encodes features by sequential modelling using the long short-term memory (LSTM). Another path computes image-level textures at multiple scales using the local binary patterns (LBP). The advantages of both streams are integrated to represent an efficient feature vector for image classification. The method is tested on eight datasets representing the human faces, skin lesions, food dishes, marine lives, etc. using four standard backbone CNNs. Our method has attained better classification accuracy over existing methods with notable margins.

Deep Neural Networks Fused with Textures for Image Classification

TL;DR

This work addresses FGIC by fusing global texture information with local patch-based deep features. It introduces DNT, a two-stream model where patches from a base CNN are encoded by an LSTM and complemented by multi-scale LBP texture histograms, with both streams fused for classification. Empirical results across eight diverse FGIC datasets and four backbones show accuracy gains and validate the contribution of patch encoding, texture descriptors, and the random region erasing augmentation. The approach demonstrates a practical, robust pathway for improving fine-grained visual recognition by leveraging complementary cues from deep representations and texture patterns.

Abstract

Fine-grained image classification (FGIC) is a challenging task in computer vision for due to small visual differences among inter-subcategories, but, large intra-class variations. Deep learning methods have achieved remarkable success in solving FGIC. In this paper, we propose a fusion approach to address FGIC by combining global texture with local patch-based information. The first pipeline extracts deep features from various fixed-size non-overlapping patches and encodes features by sequential modelling using the long short-term memory (LSTM). Another path computes image-level textures at multiple scales using the local binary patterns (LBP). The advantages of both streams are integrated to represent an efficient feature vector for image classification. The method is tested on eight datasets representing the human faces, skin lesions, food dishes, marine lives, etc. using four standard backbone CNNs. Our method has attained better classification accuracy over existing methods with notable margins.
Paper Structure (13 sections, 3 equations, 6 figures, 4 tables)

This paper contains 13 sections, 3 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Proposed method (DNT) fuses deep features and texture descriptors using local binary patterns (LBP) for fine-grained image classification.
  • Figure 2: Top-row: LBP of various neighborhoods (P, R): (8,1), (8,2), (16,1), and (16,2). Bottom-row: Random erasing data augmentation on flower and celebrity-face images.
  • Figure 3: Dataset samples are shown column-wise: human faces of FG-Net and celebrity, hand shape, and ISIC skin lesions.
  • Figure 4: Dataset samples are shown column-wise: food-dishes of India and Thailand, natural objects representing flower and marine-lives.
  • Figure 5: Confusion matrix on ISIC skin cancer dataset using DNT with $4\times4$ patches and $2\times1024$ LBP based on ResNet-50 (left) and DenseNet-201 (right) backbone CNNs.
  • ...and 1 more figures