Table of Contents
Fetching ...

Robo-SGG: Exploiting Layout-Oriented Normalization and Restitution Can Improve Robust Scene Graph Generation

Changsheng Lv, Zijian Fu, Mengshi Qi

TL;DR

Robo-SGG addresses robustness in scene graph generation under image corruptions by leveraging global layout information. It introduces two modules: Layout-Oriented Normalization and Restitution (NRM) to stabilize feature maps via Instance Normalization and layout-aware restitution, and Layout-Embedded Encoder (LEE) to adaptively fuse spatial and visual cues through gating. The approach is plug-and-play and improves robustness across multiple baselines, achieving state-of-the-art results on corruption benchmarks VG-C and GQA-C with favorable efficiency. This work offers a practical solution to domain shift in SGG, emphasizing structural feature stability over purely appearance-based cues.

Abstract

In this paper, we propose Robo-SGG, a plug-and-play module for robust scene graph generation (SGG). Unlike standard SGG, the robust scene graph generation aims to perform inference on a diverse range of corrupted images, with the core challenge being the domain shift between the clean and corrupted images. Existing SGG methods suffer from degraded performance due to shifted visual features (e.g., corruption interference or occlusions). To obtain robust visual features, we leverage layout information, representing the global structure of an image, which is robust to domain shift, to enhance the robustness of SGG methods under corruption. Specifically, we employ Instance Normalization (IN) to alleviate the domain-specific variations and recover the robust structural features (i.e., the positional and semantic relationships among objects) by the proposed Layout-Oriented Restitution. Furthermore, under corrupted images, we introduce a Layout-Embedded Encoder (LEE) that adaptively fuses layout and visual features via a gating mechanism, enhancing the robustness of positional and semantic representations for objects and predicates. Note that our proposed Robo-SGG module is designed as a plug-and-play component, which can be easily integrated into any baseline SGG model. Extensive experiments demonstrate that by integrating the state-of-the-art method into our proposed Robo-SGG, we achieve relative improvements of 6.3%, 11.1%, and 8.0% in mR@50 for PredCls, SGCls, and SGDet tasks on the VG-C benchmark, respectively, and achieve new state-of-the-art performance in the corruption scene graph generation benchmark (VG-C and GQA-C). We will release our source code and model.

Robo-SGG: Exploiting Layout-Oriented Normalization and Restitution Can Improve Robust Scene Graph Generation

TL;DR

Robo-SGG addresses robustness in scene graph generation under image corruptions by leveraging global layout information. It introduces two modules: Layout-Oriented Normalization and Restitution (NRM) to stabilize feature maps via Instance Normalization and layout-aware restitution, and Layout-Embedded Encoder (LEE) to adaptively fuse spatial and visual cues through gating. The approach is plug-and-play and improves robustness across multiple baselines, achieving state-of-the-art results on corruption benchmarks VG-C and GQA-C with favorable efficiency. This work offers a practical solution to domain shift in SGG, emphasizing structural feature stability over purely appearance-based cues.

Abstract

In this paper, we propose Robo-SGG, a plug-and-play module for robust scene graph generation (SGG). Unlike standard SGG, the robust scene graph generation aims to perform inference on a diverse range of corrupted images, with the core challenge being the domain shift between the clean and corrupted images. Existing SGG methods suffer from degraded performance due to shifted visual features (e.g., corruption interference or occlusions). To obtain robust visual features, we leverage layout information, representing the global structure of an image, which is robust to domain shift, to enhance the robustness of SGG methods under corruption. Specifically, we employ Instance Normalization (IN) to alleviate the domain-specific variations and recover the robust structural features (i.e., the positional and semantic relationships among objects) by the proposed Layout-Oriented Restitution. Furthermore, under corrupted images, we introduce a Layout-Embedded Encoder (LEE) that adaptively fuses layout and visual features via a gating mechanism, enhancing the robustness of positional and semantic representations for objects and predicates. Note that our proposed Robo-SGG module is designed as a plug-and-play component, which can be easily integrated into any baseline SGG model. Extensive experiments demonstrate that by integrating the state-of-the-art method into our proposed Robo-SGG, we achieve relative improvements of 6.3%, 11.1%, and 8.0% in mR@50 for PredCls, SGCls, and SGDet tasks on the VG-C benchmark, respectively, and achieve new state-of-the-art performance in the corruption scene graph generation benchmark (VG-C and GQA-C). We will release our source code and model.

Paper Structure

This paper contains 15 sections, 12 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: (a) Illustration of the robust SGG task. (b) Feature domain shift between clean and corrupted image features degrades model performance. (c) Robo-SGG leverages layout information to improve the robustness of structural features and object/predicate representations.
  • Figure 2: Overall framework of our proposed Robo-SGG. Stage 1, Stage 2, and Output denote the standard SGG pipeline, with our NRM and LEE modules integrated into Stage 2. In Stage 1, only clean images are used during training, while corrupted images are employed during validation and testing. Illustrated with a two-stage SGG model: (a) NRM uses Instance Normalization and layout-aware attention to alleviate domain disturbances and restore robust structural features; (b) LEE fuses visual features and bounding box coordinates embedding via gated fusion for robust object and predicate representations.
  • Figure 3: Qualitative comparisons on the PredCls task. Dashed lines: undetected predicates; solid black lines: correct predictions. Red edges: HiKER-SGG errors zhang2024hiker; green edges: our correct predictions.
  • Figure 4: Comparison of state-of-the-art SGG methods with and without our Robo-SGG.
  • Figure 5: Visualization of corrupted image and its feature maps. The regions of "woman", "bag", and "bus" are highlighted with boxes. "GT" denotes Ground Truth.
  • ...and 1 more figures