Attention over Scene Graphs: Indoor Scene Representations Toward CSAI Classification
Artur Barros, Carlos Caetano, João Macedo, Jefersson A. dos Santos, Sandra Avila
TL;DR
This paper tackles indoor scene classification and CSAI detection under privacy constraints by introducing ASGRA, a framework that operates on Scene Graphs rather than raw images. It generates structured scene graphs with Pix2Grp, extracts joint node and edge features, and uses a Graph Attention Network to perform graph-level classification with explainable attention. ASGRA achieves a strong $81.27\%$ balanced accuracy on Places8 and, in collaboration with law enforcement on CSAI data, reaches $74.27\%$ balanced accuracy and $76.55\%$ recall on RCPD, while preserving privacy by avoiding image features. The work highlights the value of graph-based representations for robust indoor-scene and CSAI analysis, offering explainability through attention analysis and pointing to open-vocabulary SG enhancements as future directions.
Abstract
Indoor scene classification is a critical task in computer vision, with wide-ranging applications that go from robotics to sensitive content analysis, such as child sexual abuse imagery (CSAI) classification. The problem is particularly challenging due to the intricate relationships between objects and complex spatial layouts. In this work, we propose the Attention over Scene Graphs for Sensitive Content Analysis (ASGRA), a novel framework that operates on structured graph representations instead of raw pixels. By first converting images into Scene Graphs and then employing a Graph Attention Network for inference, ASGRA directly models the interactions between a scene's components. This approach offers two key benefits: (i) inherent explainability via object and relationship identification, and (ii) privacy preservation, enabling model training without direct access to sensitive images. On Places8, we achieve 81.27% balanced accuracy, surpassing image-based methods. Real-world CSAI evaluation with law enforcement yields 74.27% balanced accuracy. Our results establish structured scene representations as a robust paradigm for indoor scene classification and CSAI classification. Code is publicly available at https://github.com/tutuzeraa/ASGRA.
