Table of Contents
Fetching ...

Indoor scene recognition from images under visual corruptions

Willams de Lima Costa, Raul Ismayilov, Nicola Strisciuglio, Estefania Talavera Martinez

TL;DR

This work tackles indoor scene recognition under visual corruptions by combining high-level caption-based descriptors with low-level visual features through a graph-based reasoning module and late fusion. The methodology builds a knowledge graph from caption words and uses a Deep GCN to derive $h$, which is fused with CNN features $l$ in a late fusion to form $z$. The authors introduce Places148-corrupted, a corruption-augmented indoor-scene benchmark, and show that multimodal fusion improves both clean accuracy and robustness to a wide range of corruptions, with detailed per-corruption and per-severity analyses. The results support the usefulness of high-level contextual information for robust classification and point toward extending this approach to additional modalities such as depth or video.

Abstract

The classification of indoor scenes is a critical component in various applications, such as intelligent robotics for assistive living. While deep learning has significantly advanced this field, models often suffer from reduced performance due to image corruption. This paper presents an innovative approach to indoor scene recognition that leverages multimodal data fusion, integrating caption-based semantic features with visual data to enhance both accuracy and robustness against corruption. We examine two multimodal networks that synergize visual features from CNN models with semantic captions via a Graph Convolutional Network (GCN). Our study shows that this fusion markedly improves model performance, with notable gains in Top-1 accuracy when evaluated against a corrupted subset of the Places365 dataset. Moreover, while standalone visual models displayed high accuracy on uncorrupted images, their performance deteriorated significantly with increased corruption severity. Conversely, the multimodal models demonstrated improved accuracy in clean conditions and substantial robustness to a range of image corruptions. These results highlight the efficacy of incorporating high-level contextual information through captions, suggesting a promising direction for enhancing the resilience of classification systems.

Indoor scene recognition from images under visual corruptions

TL;DR

This work tackles indoor scene recognition under visual corruptions by combining high-level caption-based descriptors with low-level visual features through a graph-based reasoning module and late fusion. The methodology builds a knowledge graph from caption words and uses a Deep GCN to derive , which is fused with CNN features in a late fusion to form . The authors introduce Places148-corrupted, a corruption-augmented indoor-scene benchmark, and show that multimodal fusion improves both clean accuracy and robustness to a wide range of corruptions, with detailed per-corruption and per-severity analyses. The results support the usefulness of high-level contextual information for robust classification and point toward extending this approach to additional modalities such as depth or video.

Abstract

The classification of indoor scenes is a critical component in various applications, such as intelligent robotics for assistive living. While deep learning has significantly advanced this field, models often suffer from reduced performance due to image corruption. This paper presents an innovative approach to indoor scene recognition that leverages multimodal data fusion, integrating caption-based semantic features with visual data to enhance both accuracy and robustness against corruption. We examine two multimodal networks that synergize visual features from CNN models with semantic captions via a Graph Convolutional Network (GCN). Our study shows that this fusion markedly improves model performance, with notable gains in Top-1 accuracy when evaluated against a corrupted subset of the Places365 dataset. Moreover, while standalone visual models displayed high accuracy on uncorrupted images, their performance deteriorated significantly with increased corruption severity. Conversely, the multimodal models demonstrated improved accuracy in clean conditions and substantial robustness to a range of image corruptions. These results highlight the efficacy of incorporating high-level contextual information through captions, suggesting a promising direction for enhancing the resilience of classification systems.
Paper Structure (15 sections, 1 equation, 8 figures, 2 tables)

This paper contains 15 sections, 1 equation, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Examples of indoor scenes with high variability within Places365 zhou2017places. In these samples, we can note that these images share little to no features with each other.
  • Figure 2: Our proposed architecture for this study. Given an RGB image as input, we divide our execution into two branches: a high-level feature extraction stream, in which we caption the image and create a large knowledge graph that we input to a GIN-like architecture, and a low-level feature extraction stream, in which we use a ResNet-50 he2015deep model to extract features. Finally, we concatenate the output of these two streams in a late fashion and apply a Fully Connected layer (FC) to do classification.
  • Figure 3: Examples of selected indoor classes from Places365 dataset.
  • Figure 4: Samples of each corruption added to the dataset, with severity level $3$. The 15 corruptions are based on the work presented in corruptions.
  • Figure 5: Precision-recall curves of our proposed networks. The area under the curve (AUC) represents the mean macro-averaged precision score. Caption GCN is equivalent to $h$, while the two CNN backbones are equivalent to $l$, and finally the fusion being equivalent to $z$.
  • ...and 3 more figures