Down-Sampling Inter-Layer Adapter for Parameter and Computation Efficient Ultra-Fine-Grained Image Recognition

Edwin Arkel Rios; Femiloye Oyerinde; Min-Chun Hu; Bo-Cheng Lai

Down-Sampling Inter-Layer Adapter for Parameter and Computation Efficient Ultra-Fine-Grained Image Recognition

Edwin Arkel Rios, Femiloye Oyerinde, Min-Chun Hu, Bo-Cheng Lai

TL;DR

This work introduces a novel approach employing down-sampling inter-layer adapters in a parameter-efficient setting, where the backbone parameters are frozen and the backbone parameters are frozen and the authors only fine-tune a small set of additional modules, making the method highly efficient.

Abstract

Ultra-fine-grained image recognition (UFGIR) categorizes objects with extremely small differences between classes, such as distinguishing between cultivars within the same species, as opposed to species-level classification in fine-grained image recognition (FGIR). The difficulty of this task is exacerbated due to the scarcity of samples per category. To tackle these challenges we introduce a novel approach employing down-sampling inter-layer adapters in a parameter-efficient setting, where the backbone parameters are frozen and we only fine-tune a small set of additional modules. By integrating dual-branch down-sampling, we significantly reduce the number of parameters and floating-point operations (FLOPs) required, making our method highly efficient. Comprehensive experiments on ten datasets demonstrate that our approach obtains outstanding accuracy-cost performance, highlighting its potential for practical applications in resource-constrained environments. In particular, our method increases the average accuracy by at least 6.8\% compared to other methods in the parameter-efficient setting while requiring at least 123x less trainable parameters compared to current state-of-the-art UFGIR methods and reducing the FLOPs by 30\% in average compared to other methods.

Down-Sampling Inter-Layer Adapter for Parameter and Computation Efficient Ultra-Fine-Grained Image Recognition

TL;DR

Abstract

Paper Structure (14 sections, 3 equations, 3 figures, 3 tables)

This paper contains 14 sections, 3 equations, 3 figures, 3 tables.

Introduction
Related Work
Ultra Fine-Grained Image Recognition
Parameter-Efficient Transfer Learning
Method
Vision Transformer Encoder
Inter-Layer Adapter
Residual Spatial Downsampling Branch
Experiment Methodology
Results and Discussion
Comparison with State-of-the-Art
Ablation on Design of RSDS
Conclusion
Experiment Methodology

Figures (3)

Figure 1: Average top-1 accuracy (%) across all evaluated datasets vs number of floating-point operations (FLOPs) for different method families, including methods that only fine-tune the classification head, fine-grained image recognition (FGIR) methods in parameter-efficient setting (PEFGIR, only fine-tune the fine-grained discrimination modules) and parameter-efficient transfer learning (PETL) methods. The size of the markers is proportional to the percentage of trainable parameters for each method.
Figure 2: Overview of ViT with our proposed Intermediate Layer Adapter (ILA). Trainable modules are shown in orange while frozen ones are shown in blue. An image is embedded into tokens and forwarded through a series of transformer encoder blocks, which we divide into three groups. After the first two encoder groups the sequence is passed through the ILA. After passing through all the encoder blocks the CLS token is forwarded through a classification head to obtain predictions. In the ILA tokens are forwarded through two spatial downsampling (SDS) branches. In the main SDS branch (highlighted as a grey box) tokens are first downsampled channel-wise and then spatially downsampled through the usage of a 2D depth-wise convolution. The sequence is then forwarded through a BatchNorm layer, a non-linear activation, and a point-wise convolution, before being up-sampled channel-wise. To allow for residual gradient flow we also forward the tokens through a Residual Spatial Downsampling (RSDS) branch implemented as a 2D depth-wise convolution initialized with values near one. Initializing the kernel to values near one allows the RSDS to behave as a learnable identity or pooling function. Then, the outputs of the dual SDS branches are added together and forwarded to the next encoder group.
Figure 3: Centered Kernel Alignment (CKA) similarity kornblith_similarity_2019 between attention layers of a ViT for the vanilla ViT (left) and ours (right). Lighter colors indicates higher similarity.

Down-Sampling Inter-Layer Adapter for Parameter and Computation Efficient Ultra-Fine-Grained Image Recognition

TL;DR

Abstract

Down-Sampling Inter-Layer Adapter for Parameter and Computation Efficient Ultra-Fine-Grained Image Recognition

Authors

TL;DR

Abstract

Table of Contents

Figures (3)