Table of Contents
Fetching ...

EnTri: Ensemble Learning with Tri-level Representations for Explainable Scene Recognition

Amirhossein Aminimehr, Amirali Molaei, Erik Cambria

TL;DR

EnTri addresses scene recognition under inter-class similarity and intra-class variation by integrating three perceptual feature levels (pixel-level, segmentation-level, and object-frequency level) through a stacked ensemble. It employs three sub-models to produce $Y_L$, $Y_M$, and $Y_H$, which are fused by a weighted meta-model into the final prediction, and pairs this with a Visual Textual Explanation Generator (VTEG) to provide interpretable visual and textual rationales. The approach demonstrates competitive accuracy on MIT67 ($87.69\%$), SUN397 ($75.56\%$), and UIUC8 ($99.17\%$), while enhancing transparency by exposing contributing objects, locations, and textures along with confidence scores. By coupling multi-level representations with an explainability module, EnTri offers a modular, scalable framework for interpretable scene recognition with potential applications in robotics and other visual intelligence tasks. The work highlights practical impact by enabling diagnostics and trust in AI systems through detailed, level-aligned explanations.

Abstract

Scene recognition based on deep-learning has made significant progress, but there are still limitations in its performance due to challenges posed by inter-class similarities and intra-class dissimilarities. Furthermore, prior research has primarily focused on improving classification accuracy, yet it has given less attention to achieving interpretable, precise scene classification. Therefore, we are motivated to propose EnTri, an ensemble scene recognition framework that employs ensemble learning using a hierarchy of visual features. EnTri represents features at three distinct levels of detail: pixel-level, semantic segmentation-level, and object class and frequency level. By incorporating distinct feature encoding schemes of differing complexity and leveraging ensemble strategies, our approach aims to improve classification accuracy while enhancing transparency and interpretability via visual and textual explanations. To achieve interpretability, we devised an extension algorithm that generates both visual and textual explanations highlighting various properties of a given scene that contribute to the final prediction of its category. This includes information about objects, statistics, spatial layout, and textural details. Through experiments on benchmark scene classification datasets, EnTri has demonstrated superiority in terms of recognition accuracy, achieving competitive performance compared to state-of-the-art approaches, with an accuracy of 87.69%, 75.56%, and 99.17% on the MIT67, SUN397, and UIUC8 datasets, respectively.

EnTri: Ensemble Learning with Tri-level Representations for Explainable Scene Recognition

TL;DR

EnTri addresses scene recognition under inter-class similarity and intra-class variation by integrating three perceptual feature levels (pixel-level, segmentation-level, and object-frequency level) through a stacked ensemble. It employs three sub-models to produce , , and , which are fused by a weighted meta-model into the final prediction, and pairs this with a Visual Textual Explanation Generator (VTEG) to provide interpretable visual and textual rationales. The approach demonstrates competitive accuracy on MIT67 (), SUN397 (), and UIUC8 (), while enhancing transparency by exposing contributing objects, locations, and textures along with confidence scores. By coupling multi-level representations with an explainability module, EnTri offers a modular, scalable framework for interpretable scene recognition with potential applications in robotics and other visual intelligence tasks. The work highlights practical impact by enabling diagnostics and trust in AI systems through detailed, level-aligned explanations.

Abstract

Scene recognition based on deep-learning has made significant progress, but there are still limitations in its performance due to challenges posed by inter-class similarities and intra-class dissimilarities. Furthermore, prior research has primarily focused on improving classification accuracy, yet it has given less attention to achieving interpretable, precise scene classification. Therefore, we are motivated to propose EnTri, an ensemble scene recognition framework that employs ensemble learning using a hierarchy of visual features. EnTri represents features at three distinct levels of detail: pixel-level, semantic segmentation-level, and object class and frequency level. By incorporating distinct feature encoding schemes of differing complexity and leveraging ensemble strategies, our approach aims to improve classification accuracy while enhancing transparency and interpretability via visual and textual explanations. To achieve interpretability, we devised an extension algorithm that generates both visual and textual explanations highlighting various properties of a given scene that contribute to the final prediction of its category. This includes information about objects, statistics, spatial layout, and textural details. Through experiments on benchmark scene classification datasets, EnTri has demonstrated superiority in terms of recognition accuracy, achieving competitive performance compared to state-of-the-art approaches, with an accuracy of 87.69%, 75.56%, and 99.17% on the MIT67, SUN397, and UIUC8 datasets, respectively.
Paper Structure (29 sections, 9 equations, 12 figures, 5 tables, 3 algorithms)

This paper contains 29 sections, 9 equations, 12 figures, 5 tables, 3 algorithms.

Figures (12)

  • Figure 1: Demonstrations of inter-class similarity and intra-class variation. a) Images from the auditorium and movie theater classes have a high degree of similarity (inter-class similarity). b) Images of the office demonstrate a considerable degree of intra-class diversity, suggesting a wide spectrum of visual features within the category.
  • Figure 2: An example showing different levels of representation extracted from a scene using EnTri.
  • Figure 3: An overview of the proposed framework. The input scene image is first passed into the low-level, mid-level, and high-level sub-models, each producing a prediction matrix. (1) The low-level sub-model uses multiple CNN discriminators to produce the output matrix containing the softmax scores associated with each model; (2) The mid-level model generates multiple segmentation maps via semantic segmentation models, concatenates these maps to build the mid-level representation, and then employs multiple CNN discriminators to generate the prediction matrix; (3) The high-level sub-model constructs the high-level representation by concatenating the output of multiple object detectors and then passes it to the fully-connected discriminators. Finally, the meta-model combines the flattened prediction matrices with a weighted combination scheme, then passes this combination to a fully-connected network to determine the final scene category.
  • Figure 4: Weighted combination process. The calculated weight values in the constructed weight vectors are multiplied by the entire row from the sub-model matrix that matches the associated discriminator.
  • Figure 5: Building blocks of the VTEG algorithm. (a) An explainable AI method generates heatmap explanations for each image classification model, and an object detection algorithm generates an annotated detection image, which is combined with the heatmaps. The top three objects are then extracted using the rate of overlap between the heatmaps and bounding boxes. (b) The segment-based perturbation technique masks different segments of each segmentation map and calculates the importance of objects by evaluating their impact on the prediction score, helping to identify the top three objects. In cases where there are multiple identical objects, the selection process prioritizes the object with the highest score. (c) The statistical and object-based perturbation techniques help identify the top three important objects in the high-level representation and their importance frequency in the scene. Once all the attributes have been extracted, they are inserted into predetermined text formats along with the percentage of agreement between a sub-model and the prediction generated by the meta-model (the high-level section in the textual explanation includes only one object since the other objects scored zero in importance).
  • ...and 7 more figures