Table of Contents
Fetching ...

Attention over Scene Graphs: Indoor Scene Representations Toward CSAI Classification

Artur Barros, Carlos Caetano, João Macedo, Jefersson A. dos Santos, Sandra Avila

TL;DR

This paper tackles indoor scene classification and CSAI detection under privacy constraints by introducing ASGRA, a framework that operates on Scene Graphs rather than raw images. It generates structured scene graphs with Pix2Grp, extracts joint node and edge features, and uses a Graph Attention Network to perform graph-level classification with explainable attention. ASGRA achieves a strong $81.27\%$ balanced accuracy on Places8 and, in collaboration with law enforcement on CSAI data, reaches $74.27\%$ balanced accuracy and $76.55\%$ recall on RCPD, while preserving privacy by avoiding image features. The work highlights the value of graph-based representations for robust indoor-scene and CSAI analysis, offering explainability through attention analysis and pointing to open-vocabulary SG enhancements as future directions.

Abstract

Indoor scene classification is a critical task in computer vision, with wide-ranging applications that go from robotics to sensitive content analysis, such as child sexual abuse imagery (CSAI) classification. The problem is particularly challenging due to the intricate relationships between objects and complex spatial layouts. In this work, we propose the Attention over Scene Graphs for Sensitive Content Analysis (ASGRA), a novel framework that operates on structured graph representations instead of raw pixels. By first converting images into Scene Graphs and then employing a Graph Attention Network for inference, ASGRA directly models the interactions between a scene's components. This approach offers two key benefits: (i) inherent explainability via object and relationship identification, and (ii) privacy preservation, enabling model training without direct access to sensitive images. On Places8, we achieve 81.27% balanced accuracy, surpassing image-based methods. Real-world CSAI evaluation with law enforcement yields 74.27% balanced accuracy. Our results establish structured scene representations as a robust paradigm for indoor scene classification and CSAI classification. Code is publicly available at https://github.com/tutuzeraa/ASGRA.

Attention over Scene Graphs: Indoor Scene Representations Toward CSAI Classification

TL;DR

This paper tackles indoor scene classification and CSAI detection under privacy constraints by introducing ASGRA, a framework that operates on Scene Graphs rather than raw images. It generates structured scene graphs with Pix2Grp, extracts joint node and edge features, and uses a Graph Attention Network to perform graph-level classification with explainable attention. ASGRA achieves a strong balanced accuracy on Places8 and, in collaboration with law enforcement on CSAI data, reaches balanced accuracy and recall on RCPD, while preserving privacy by avoiding image features. The work highlights the value of graph-based representations for robust indoor-scene and CSAI analysis, offering explainability through attention analysis and pointing to open-vocabulary SG enhancements as future directions.

Abstract

Indoor scene classification is a critical task in computer vision, with wide-ranging applications that go from robotics to sensitive content analysis, such as child sexual abuse imagery (CSAI) classification. The problem is particularly challenging due to the intricate relationships between objects and complex spatial layouts. In this work, we propose the Attention over Scene Graphs for Sensitive Content Analysis (ASGRA), a novel framework that operates on structured graph representations instead of raw pixels. By first converting images into Scene Graphs and then employing a Graph Attention Network for inference, ASGRA directly models the interactions between a scene's components. This approach offers two key benefits: (i) inherent explainability via object and relationship identification, and (ii) privacy preservation, enabling model training without direct access to sensitive images. On Places8, we achieve 81.27% balanced accuracy, surpassing image-based methods. Real-world CSAI evaluation with law enforcement yields 74.27% balanced accuracy. Our results establish structured scene representations as a robust paradigm for indoor scene classification and CSAI classification. Code is publicly available at https://github.com/tutuzeraa/ASGRA.

Paper Structure

This paper contains 7 sections, 3 figures, 6 tables.

Figures (3)

  • Figure 1: The ASGRA framework processes input images through a pre-trained SGG model to generate structured graph representations. Detected objects and bounding boxes become node features while relations form edge features. A GAT performs learning and inference, with attention pooling, and a multilayer perceptron (MLP) predicts the indoor scene category.
  • Figure 2: Confusion matrices on the Places8 test split.
  • Figure 3: Qualitative results of ASGRA on Places8. Column (a) shows correctly classified scenes, while column (b) shows misclassifications. Each image includes its scene graph with GATv2 attention scores for nodes and edges. These visualizations showcase the model's ability to identify key semantic components for correct predictions and provide transparent analysis of failure cases, including confusion between similar scenes.