Table of Contents
Fetching ...

Complementary Frequency-Varying Awareness Network for Open-Set Fine-Grained Image Recognition

Qiulei Dong, Jiayin Sun, Mengyu Gao

TL;DR

The paper tackles open-set fine-grained image recognition by designing CFAN, a three-module network that captures both high- and low-frequency information through a frequency-adjustable filtering mechanism and dual LSTM-based temporal fusion. CFAN-OSFGR applies this frequency-aware feature extraction to open-set recognition, achieving superior performance across multiple fine- and coarse-grained datasets and settings. The key contributions include a novel frequency-adjustable filter, a CFAN architecture for frequency-aware feature learning, and strong empirical results plus comprehensive ablations. This work demonstrates that balancing frequency content in features improves robustness to unknown classes and enhances discriminability in fine-grained open-set scenarios.

Abstract

Open-set image recognition is a challenging topic in computer vision. Most of the existing works in literature focus on learning more discriminative features from the input images, however, they are usually insensitive to the high- or low-frequency components in features, resulting in a decreasing performance on fine-grained image recognition. To address this problem, we propose a Complementary Frequency-varying Awareness Network that could better capture both high-frequency and low-frequency information, called CFAN. The proposed CFAN consists of three sequential modules: (i) a feature extraction module is introduced for learning preliminary features from the input images; (ii) a frequency-varying filtering module is designed to separate out both high- and low-frequency components from the preliminary features in the frequency domain via a frequency-adjustable filter; (iii) a complementary temporal aggregation module is designed for aggregating the high- and low-frequency components via two Long Short-Term Memory networks into discriminative features. Based on CFAN, we further propose an open-set fine-grained image recognition method, called CFAN-OSFGR, which learns image features via CFAN and classifies them via a linear classifier. Experimental results on 3 fine-grained datasets and 2 coarse-grained datasets demonstrate that CFAN-OSFGR performs significantly better than 9 state-of-the-art methods in most cases.

Complementary Frequency-Varying Awareness Network for Open-Set Fine-Grained Image Recognition

TL;DR

The paper tackles open-set fine-grained image recognition by designing CFAN, a three-module network that captures both high- and low-frequency information through a frequency-adjustable filtering mechanism and dual LSTM-based temporal fusion. CFAN-OSFGR applies this frequency-aware feature extraction to open-set recognition, achieving superior performance across multiple fine- and coarse-grained datasets and settings. The key contributions include a novel frequency-adjustable filter, a CFAN architecture for frequency-aware feature learning, and strong empirical results plus comprehensive ablations. This work demonstrates that balancing frequency content in features improves robustness to unknown classes and enhances discriminability in fine-grained open-set scenarios.

Abstract

Open-set image recognition is a challenging topic in computer vision. Most of the existing works in literature focus on learning more discriminative features from the input images, however, they are usually insensitive to the high- or low-frequency components in features, resulting in a decreasing performance on fine-grained image recognition. To address this problem, we propose a Complementary Frequency-varying Awareness Network that could better capture both high-frequency and low-frequency information, called CFAN. The proposed CFAN consists of three sequential modules: (i) a feature extraction module is introduced for learning preliminary features from the input images; (ii) a frequency-varying filtering module is designed to separate out both high- and low-frequency components from the preliminary features in the frequency domain via a frequency-adjustable filter; (iii) a complementary temporal aggregation module is designed for aggregating the high- and low-frequency components via two Long Short-Term Memory networks into discriminative features. Based on CFAN, we further propose an open-set fine-grained image recognition method, called CFAN-OSFGR, which learns image features via CFAN and classifies them via a linear classifier. Experimental results on 3 fine-grained datasets and 2 coarse-grained datasets demonstrate that CFAN-OSFGR performs significantly better than 9 state-of-the-art methods in most cases.
Paper Structure (16 sections, 8 equations, 4 figures, 11 tables)

This paper contains 16 sections, 8 equations, 4 figures, 11 tables.

Figures (4)

  • Figure 1: Architecture of the proposed CFAN-OSFGR method: Firstly, an image is fed into the proposed CFAN for obtaining a corresponding discriminative feature $\boldsymbol{X}'$, which consists of three sequential modules, including a feature extraction module (ResNet50 ResNet and SwinB Swin are respectively used here, and we show the SwinB as an example) for extracting a preliminary image feature $\boldsymbol{X}$, a frequency-varying filtering module for separating out the high- and the low-frequency components from the preliminary feature in the frequency domain, and a complementary temporal aggregation module for aggregating the high- and the low-frequency feature components into a discriminative feature. Then, the discriminative feature is fed into a linear classifier. CFAN-OSFGR is trained by minimizing the classification loss $\mathcal{L}_{cls}$.
  • Figure 2: Sketch of the sequence of band-pass template filters $\{ \mathbf{T}_i \}_{i=1}^{N_t}$ at each channel in the frequency-varying filtering module. Each value in these filters is either 1 (in black) or 0 (in white).
  • Figure 3: OSFGR results on Aircraft for analyzing the influence of: (a) randomizing the initial adjustable vectors $\boldsymbol{p}_h^1$ and $\boldsymbol{p}_l^1$ in the FVF module; (b) different numbers of template filters in the FVF module; (c) aggregating both high- and low-frequency components in the CTA module; (d) temporal aggregation in the CTA module. The three metrics, ACC, AUROC, and OSCR, are denoted as 'AC', 'AU', and 'OS', respectively. The three difficulty modes, 'Easy', 'Medium', and 'Hard', are abbreviated to 'E', 'M', and 'H', respectively.
  • Figure 4: Heatmaps on six CUB images obtained by ResNet50 backbone (the second row), ResNet50-based CFAN-OSFGR (the third row), SwinB backbone (the fourth row), and SwinB-based CFAN-OSFGR (the fifth row). The attentions in red regions are strongest, while the attentions in yellow, green, and blue regions decrease by degrees.