Deep Neural Networks Fused with Textures for Image Classification
Asish Bera, Debotosh Bhattacharjee, Mita Nasipuri
TL;DR
This work addresses FGIC by fusing global texture information with local patch-based deep features. It introduces DNT, a two-stream model where patches from a base CNN are encoded by an LSTM and complemented by multi-scale LBP texture histograms, with both streams fused for classification. Empirical results across eight diverse FGIC datasets and four backbones show accuracy gains and validate the contribution of patch encoding, texture descriptors, and the random region erasing augmentation. The approach demonstrates a practical, robust pathway for improving fine-grained visual recognition by leveraging complementary cues from deep representations and texture patterns.
Abstract
Fine-grained image classification (FGIC) is a challenging task in computer vision for due to small visual differences among inter-subcategories, but, large intra-class variations. Deep learning methods have achieved remarkable success in solving FGIC. In this paper, we propose a fusion approach to address FGIC by combining global texture with local patch-based information. The first pipeline extracts deep features from various fixed-size non-overlapping patches and encodes features by sequential modelling using the long short-term memory (LSTM). Another path computes image-level textures at multiple scales using the local binary patterns (LBP). The advantages of both streams are integrated to represent an efficient feature vector for image classification. The method is tested on eight datasets representing the human faces, skin lesions, food dishes, marine lives, etc. using four standard backbone CNNs. Our method has attained better classification accuracy over existing methods with notable margins.
