Adaptive Visual Scene Understanding: Incremental Scene Graph Generation

Naitik Khandelwal; Xiao Liu; Mengmi Zhang

Adaptive Visual Scene Understanding: Incremental Scene Graph Generation

Naitik Khandelwal, Xiao Liu, Mengmi Zhang

TL;DR

The experimental results not only highlight the challenges of directly combining existing continual learning methods with SGG backbones but also demonstrate the effectiveness of the proposed approach, enhancing CSEGG efficiency while simultaneously preserving privacy and memory usage.

Abstract

Scene graph generation (SGG) analyzes images to extract meaningful information about objects and their relationships. In the dynamic visual world, it is crucial for AI systems to continuously detect new objects and establish their relationships with existing ones. Recently, numerous studies have focused on continual learning within the domains of object detection and image recognition. However, a limited amount of research focuses on a more challenging continual learning problem in SGG. This increased difficulty arises from the intricate interactions and dynamic relationships among objects, and their associated contexts. Thus, in continual learning, SGG models are often required to expand, modify, retain, and reason scene graphs within the process of adaptive visual scene understanding. To systematically explore Continual Scene Graph Generation (CSEGG), we present a comprehensive benchmark comprising three learning regimes: relationship incremental, scene incremental, and relationship generalization. Moreover, we introduce a ``Replays via Analysis by Synthesis" method named RAS. This approach leverages the scene graphs, decomposes and re-composes them to represent different scenes, and replays the synthesized scenes based on these compositional scene graphs. The replayed synthesized scenes act as a means to practice and refine proficiency in SGG in known and unknown environments. Our experimental results not only highlight the challenges of directly combining existing continual learning methods with SGG backbones but also demonstrate the effectiveness of our proposed approach, enhancing CSEGG efficiency while simultaneously preserving privacy and memory usage. All data and source code are publicly available online.

Adaptive Visual Scene Understanding: Incremental Scene Graph Generation

TL;DR

Abstract

Paper Structure (31 sections, 17 figures, 4 tables)

This paper contains 31 sections, 17 figures, 4 tables.

Introduction
Related Works
Continual ScenE Graph Generation Benchmark
Learning Scenarios
Competitive CSEGG Baselines
Evaluation Metrics
Replays via Analysis by Synthesis (RAS)
Results
RAS outperforms all the CSEGG baselines in Scenarios 1 and 2
CSEGG Models Can Generalize in Unknown Scenes
Ablation Studies on Our RAS Reveal Key Design Insights
Discussion
Appendix
Introduction to Three Learning Scenarios
Scenario 1 (S1): Relationship Incremental Learning
...and 16 more sections

Figures (17)

Figure 1: (a) A scene graph is a graph structure, where objects are represented as nodes (red boxes), and the relationships between objects are represented as edges connecting the corresponding nodes (green boxes). Each node in the graph contains information such as the object's class label, and spatial location. The edges in the graph indicate the relationships between objects, often described by predicates. A scene graph can be parsed into a set of triplets, consisting of three components: a subject, a relationship predicate, and an object that serves as the target or object of the relationship. The graph allows for a compact and structured representation of the objects and their relationships within a visual scene. (b) An example CSEGG application is presented, where a robot continuously encounters new objects (blue) and new relationships (yellow) over time across new scenes.
Figure 1: Overview of three CSEGG learning scenarios. This table summarizes the three learning scenarios (Column 1) in CSEGG, including the number of tasks, the number of object (#Objs) and relationship (#Rels) classes, the evaluation metrics, the SGG-Backbones used, and the continual learning (CL) baselines. The Kn. and Unk. columns provide information regarding what is known to the CSEGG models during training in that scenario and what is being incrementally learned by the models. Unknown information is being incrementally learned by the models. See Sec. \ref{['sec: csegg_benmark']} for details.
Figure 2: Three learning scenarios are introduced. From left to right, they are S1. relationship (Rel.) incremental learning (Incre.); S2. scene incremental learning; and S3. relationship generalization (Rel. Gen.) in Object Incre.. In S1 and S2, example triplet labels in the training (solid line) and test sets (dotted line) from each task are presented. The training and test sets from the same task are color-coded. Blue color indicates task 1 and orange color indicates task 2. The new objects or relationships in each task are bold and underlined. In S3, one single test set (dotted gray box) is used for benchmarking the relationship generalization of object incre. learning models across all the tasks.
Figure 2: Results of CSEGG for various continual learning methods applied on the two SGG backbones (SGTR and TCNN) in Learning Scenarios 1 and 2. See Sec. \ref{['sec:cl baselines']} for continual learning baselines. See Sec. \ref{['sec:metrics']} for evaluation metrics. The higher the evaluation metrics, the better. The best are in bold. * means the experiment is still running, we will report the results in the final version.
Figure 3: Schematic of our proposed Replays via Analysis by Synthesis (RAS) method. At task $t+1$, our RAS stores all the triplet labels $U_t$, such as <man, on, horse>, from the previous tasks. It then re-composes these triplet labels to create in-context prompts, utilizing them as inputs to generative image models to synthesize images for replays. For predicting scene graphs on these synthesized images, we employ the frozen model $M_{t}$ from the preceding task $t$, marked with "snowflakes". Subsequently, these predicted scene graph notations, along with their corresponding synthesized images, contribute to "pseudo" replays, preventing the current model $M_{t+1}$ from experiencing forgetting. See Sec. \ref{['sec: ras']} for more details.
...and 12 more figures

Adaptive Visual Scene Understanding: Incremental Scene Graph Generation

TL;DR

Abstract

Adaptive Visual Scene Understanding: Incremental Scene Graph Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (17)