Enhancing Scene Graph Generation with Hierarchical Relationships and Commonsense Knowledge
Bowen Jiang, Zhijun Zhuang, Shreyas S. Shivakumar, Camillo J. Taylor
TL;DR
This work tackles scene graph generation by introducing a hierarchical relation head that jointly predicts relation super-categories and intra-category relations, improving interpretability and performance. It further adds a commonsense validation pipeline using small foundation-models to critique and filter predicate predictions, yielding commonsense-aligned graphs even when dataset annotations are sparse. The proposed plug-and-play modules are model-agnostic and demonstrate consistent gains across Visual Genome and OpenImage V6, including improvements in recall-based metrics and zero-shot scenarios. Importantly, the approach remains accessible on local devices via language-only models, and automatic relation hierarchy clustering further supports scalability to larger datasets without manual labeling.
Abstract
This work introduces an enhanced approach to generating scene graphs by incorporating both a relationship hierarchy and commonsense knowledge. Specifically, we begin by proposing a hierarchical relation head that exploits an informative hierarchical structure. It jointly predicts the relation super-category between object pairs in an image, along with detailed relations under each super-category. Following this, we implement a robust commonsense validation pipeline that harnesses foundation models to critique the results from the scene graph prediction system, removing nonsensical predicates even with a small language-only model. Extensive experiments on Visual Genome and OpenImage V6 datasets demonstrate that the proposed modules can be seamlessly integrated as plug-and-play enhancements to existing scene graph generation algorithms. The results show significant improvements with an extensive set of reasonable predictions beyond dataset annotations. Codes are available at https://github.com/bowen-upenn/scene_graph_commonsense.
