Table of Contents
Fetching ...

Enhancing Scene Graph Generation with Hierarchical Relationships and Commonsense Knowledge

Bowen Jiang, Zhijun Zhuang, Shreyas S. Shivakumar, Camillo J. Taylor

TL;DR

This work tackles scene graph generation by introducing a hierarchical relation head that jointly predicts relation super-categories and intra-category relations, improving interpretability and performance. It further adds a commonsense validation pipeline using small foundation-models to critique and filter predicate predictions, yielding commonsense-aligned graphs even when dataset annotations are sparse. The proposed plug-and-play modules are model-agnostic and demonstrate consistent gains across Visual Genome and OpenImage V6, including improvements in recall-based metrics and zero-shot scenarios. Importantly, the approach remains accessible on local devices via language-only models, and automatic relation hierarchy clustering further supports scalability to larger datasets without manual labeling.

Abstract

This work introduces an enhanced approach to generating scene graphs by incorporating both a relationship hierarchy and commonsense knowledge. Specifically, we begin by proposing a hierarchical relation head that exploits an informative hierarchical structure. It jointly predicts the relation super-category between object pairs in an image, along with detailed relations under each super-category. Following this, we implement a robust commonsense validation pipeline that harnesses foundation models to critique the results from the scene graph prediction system, removing nonsensical predicates even with a small language-only model. Extensive experiments on Visual Genome and OpenImage V6 datasets demonstrate that the proposed modules can be seamlessly integrated as plug-and-play enhancements to existing scene graph generation algorithms. The results show significant improvements with an extensive set of reasonable predictions beyond dataset annotations. Codes are available at https://github.com/bowen-upenn/scene_graph_commonsense.

Enhancing Scene Graph Generation with Hierarchical Relationships and Commonsense Knowledge

TL;DR

This work tackles scene graph generation by introducing a hierarchical relation head that jointly predicts relation super-categories and intra-category relations, improving interpretability and performance. It further adds a commonsense validation pipeline using small foundation-models to critique and filter predicate predictions, yielding commonsense-aligned graphs even when dataset annotations are sparse. The proposed plug-and-play modules are model-agnostic and demonstrate consistent gains across Visual Genome and OpenImage V6, including improvements in recall-based metrics and zero-shot scenarios. Importantly, the approach remains accessible on local devices via language-only models, and automatic relation hierarchy clustering further supports scalability to larger datasets without manual labeling.

Abstract

This work introduces an enhanced approach to generating scene graphs by incorporating both a relationship hierarchy and commonsense knowledge. Specifically, we begin by proposing a hierarchical relation head that exploits an informative hierarchical structure. It jointly predicts the relation super-category between object pairs in an image, along with detailed relations under each super-category. Following this, we implement a robust commonsense validation pipeline that harnesses foundation models to critique the results from the scene graph prediction system, removing nonsensical predicates even with a small language-only model. Extensive experiments on Visual Genome and OpenImage V6 datasets demonstrate that the proposed modules can be seamlessly integrated as plug-and-play enhancements to existing scene graph generation algorithms. The results show significant improvements with an extensive set of reasonable predictions beyond dataset annotations. Codes are available at https://github.com/bowen-upenn/scene_graph_commonsense.
Paper Structure (26 sections, 3 equations, 6 figures, 7 tables)

This paper contains 26 sections, 3 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Example scene graph. Each edge represents a predicted relationship under a geometric, possessive, or semantic super-category, and it can be either commonsense-aligned or violated.
  • Figure 2: This diagram provides an overview of HIERCOM. The core components of HIERCOM are the hierarchical relation head and the commonsense validation pipeline, both of which are model-agnostic plug-and-play modules, suitable for integration with a variety of baseline scene graph generation models that have a conventional flat classification head. Specifically, the hierarchical relation head is designed to replace this flat layer, jointly estimating relation super-categories and more granular relations within each category. Additionally, the diagram depicts a baseline scene graph generation model: an RGB image and depth maps are the inputs, and it generates feature maps, object labels, and bounding boxes using DETR. Relations estimation between each pair of instances occurs in two separate passes to account for directional relationships, first assuming one instance as the subject and then the other. Subsequently, the commonsense validation pipeline leverages an LLM or VLM - which can have a small size - to filter out commonsense-violating predicates
  • Figure 3: Prompt engineering across different foundation models, where "{}" is a placeholder for a triplet written in string, such as "girl riding skateboard". For LLaMA-3-8B, which lacks the vision capability, we employ three distinct prompts for each of the top $m$-$n$ predicted triplets, and collect a majority vote on whether each triplet makes sense to enhance the robustness. In contrast, LLaVA-1.6-7B is prompted to verify whether each of the top $m$-$n$ predicted triplets actually appears in the image. Since GPT-3.5 is much larger than LLaMA-3-8B with better instruction-following capabilities, we use a single prompt to evaluate all the top $m$-$n$ predicates in the (sub)graph of each image, and collect a list of 'Yes' or 'No' responses simultaneously with a higer efficiency.
  • Figure 4: Illustration of generated scene graphs on predicate classification. All examples are from the testing dataset of Visual Genome. The first row displays images and objects, while the second row displays the final scene graphs. The third row shows an ablation without commonsense validation. For each image, we display the top 10 most confident predictions, and each edge is annotated with its relation label and super-category. Meanwhile, it is possible for an edge to have multiple predicted relationships, but they must come from disjoint super-categories. In this figure, pink edges are true positives in the dataset. Blue edges represent incorrect edges based on our observations Interestingly, all the black edges are reasonable predictions we believe but not annotated, which should not be regarded as false positives.
  • Figure 5: This figure compares the histograms between the original IETrans zhang2022fine and IETrans integrated with the proposed hierarchical relation head and commonsense validation, denoted as IETrans+HIERCOM. Different background colors for each relation label correspond to their super-category, as defined in zellers2018neural. While there is a slight decrease in performance for the head classes, continuing improvements are observed in the tail classes. Along with the data in Table \ref{['tab:long_tail']}, we show that our proposed methods not only elevate the mR@$k$ scores but also maintain a good balance between the head and tail classes, simultaneously enhancing the R@$k$ scores.
  • ...and 1 more figures