EnTri: Ensemble Learning with Tri-level Representations for Explainable Scene Recognition
Amirhossein Aminimehr, Amirali Molaei, Erik Cambria
TL;DR
EnTri addresses scene recognition under inter-class similarity and intra-class variation by integrating three perceptual feature levels (pixel-level, segmentation-level, and object-frequency level) through a stacked ensemble. It employs three sub-models to produce $Y_L$, $Y_M$, and $Y_H$, which are fused by a weighted meta-model into the final prediction, and pairs this with a Visual Textual Explanation Generator (VTEG) to provide interpretable visual and textual rationales. The approach demonstrates competitive accuracy on MIT67 ($87.69\%$), SUN397 ($75.56\%$), and UIUC8 ($99.17\%$), while enhancing transparency by exposing contributing objects, locations, and textures along with confidence scores. By coupling multi-level representations with an explainability module, EnTri offers a modular, scalable framework for interpretable scene recognition with potential applications in robotics and other visual intelligence tasks. The work highlights practical impact by enabling diagnostics and trust in AI systems through detailed, level-aligned explanations.
Abstract
Scene recognition based on deep-learning has made significant progress, but there are still limitations in its performance due to challenges posed by inter-class similarities and intra-class dissimilarities. Furthermore, prior research has primarily focused on improving classification accuracy, yet it has given less attention to achieving interpretable, precise scene classification. Therefore, we are motivated to propose EnTri, an ensemble scene recognition framework that employs ensemble learning using a hierarchy of visual features. EnTri represents features at three distinct levels of detail: pixel-level, semantic segmentation-level, and object class and frequency level. By incorporating distinct feature encoding schemes of differing complexity and leveraging ensemble strategies, our approach aims to improve classification accuracy while enhancing transparency and interpretability via visual and textual explanations. To achieve interpretability, we devised an extension algorithm that generates both visual and textual explanations highlighting various properties of a given scene that contribute to the final prediction of its category. This includes information about objects, statistics, spatial layout, and textural details. Through experiments on benchmark scene classification datasets, EnTri has demonstrated superiority in terms of recognition accuracy, achieving competitive performance compared to state-of-the-art approaches, with an accuracy of 87.69%, 75.56%, and 99.17% on the MIT67, SUN397, and UIUC8 datasets, respectively.
