Table of Contents
Fetching ...

Less yet robust: crucial region selection for scene recognition

Jianqi Zhang, Mengxuan Wang, Jingyao Wang, Lingyu Si, Changwen Zheng, Fanjiang Xu

TL;DR

This work implements a learnable mask in the neural network, which can filter high-level features by assigning weights to different regions of the feature matrix, and introduces a regularization term to enhance the significance of key high-level feature regions.

Abstract

Scene recognition, particularly for aerial and underwater images, often suffers from various types of degradation, such as blurring or overexposure. Previous works that focus on convolutional neural networks have been shown to be able to extract panoramic semantic features and perform well on scene recognition tasks. However, low-quality images still impede model performance due to the inappropriate use of high-level semantic features. To address these challenges, we propose an adaptive selection mechanism to identify the most important and robust regions with high-level features. Thus, the model can perform learning via these regions to avoid interference. implement a learnable mask in the neural network, which can filter high-level features by assigning weights to different regions of the feature matrix. We also introduce a regularization term to further enhance the significance of key high-level feature regions. Different from previous methods, our learnable matrix pays extra attention to regions that are important to multiple categories but may cause misclassification and sets constraints to reduce the influence of such regions.This is a plug-and-play architecture that can be easily extended to other methods. Additionally, we construct an Underwater Geological Scene Classification dataset to assess the effectiveness of our model. Extensive experimental results demonstrate the superiority and robustness of our proposed method over state-of-the-art techniques on two datasets.

Less yet robust: crucial region selection for scene recognition

TL;DR

This work implements a learnable mask in the neural network, which can filter high-level features by assigning weights to different regions of the feature matrix, and introduces a regularization term to enhance the significance of key high-level feature regions.

Abstract

Scene recognition, particularly for aerial and underwater images, often suffers from various types of degradation, such as blurring or overexposure. Previous works that focus on convolutional neural networks have been shown to be able to extract panoramic semantic features and perform well on scene recognition tasks. However, low-quality images still impede model performance due to the inappropriate use of high-level semantic features. To address these challenges, we propose an adaptive selection mechanism to identify the most important and robust regions with high-level features. Thus, the model can perform learning via these regions to avoid interference. implement a learnable mask in the neural network, which can filter high-level features by assigning weights to different regions of the feature matrix. We also introduce a regularization term to further enhance the significance of key high-level feature regions. Different from previous methods, our learnable matrix pays extra attention to regions that are important to multiple categories but may cause misclassification and sets constraints to reduce the influence of such regions.This is a plug-and-play architecture that can be easily extended to other methods. Additionally, we construct an Underwater Geological Scene Classification dataset to assess the effectiveness of our model. Extensive experimental results demonstrate the superiority and robustness of our proposed method over state-of-the-art techniques on two datasets.
Paper Structure (14 sections, 4 equations, 6 figures, 1 table)

This paper contains 14 sections, 4 equations, 6 figures, 1 table.

Figures (6)

  • Figure 1: The first row shows the input images, where the red boxes indicate the areas sufficient to determine the class of the image. The second row (obtained through Grad-CAM Grad2017) shows the areas that ResNet18 focuses on when predicting these images. The labels of the images and the confidence score for these labels are shown at the bottom of the images. It is evident that the attention areas of ResNet18 are significantly larger than the regions within the red boxes.
  • Figure 2: The proposed method's overall framework. Typically, features near the input end are considered to be low-level semantic information, containing fine-grained semantics, while features near the output end are considered to be high-level semantic information, containing coarser semantics.
  • Figure 3: Example of the UGS dataset, which contains multiple common marine geology categories, e.g., sediment and rock.
  • Figure 4: Visualization of changes in the attention areas of the input images before and after incorporating the mask matrix $\mathbf{M}$ in ResNet18. The * behind the model name indicates that the model incorporates our mask matrix $\mathbf{M}$.
  • Figure 5: Comparison of model accuracy under the influence of noise. The * behind the model name indicates that the model incorporates our mask matrix $\mathbf{M}$. (a) Using ResNet-based models and Gaussian noise. (b) Using ResNet-based models and salt-and-pepper noise. (c) Using MobileNet-based models and Gaussian noise. (d) Using MobileNet-based models and salt-and-pepper noise. The experiments are conducted on the UCM dataset.
  • ...and 1 more figures