Table of Contents
Fetching ...

What can Off-the-Shelves Large Multi-Modal Models do for Dynamic Scene Graph Generation?

Xuanming Cui, Jaiminkumar Ashokbhai Bhoi, Chionh Wei Peng, Adriel Kuek, Ser Nam Lim

TL;DR

This work tackles Dynamic Scene Graph Generation (DSGG) for videos by leveraging off-the-shelf Large Multimodal Models (LMMs) with simple decoder-only architectures, showing that fine-tuning with as little as 5-10% of data yields state-of-the-art DSGG performance across SGCLS* and SGDet on Action Genome and VidVRD. It reframes DSGG as next-token prediction, grounding generated triplets with an open-vocabulary detector and introducing a Triplet Importance Prior to rank predictions by informativeness and novelty. The approach addresses long-tail predicates, mitigates the precision-recall imbalance typical of DSGG, and proposes a more realistic evaluation by incorporating both recall and precision and a ranking-based metric (nDCG). This work demonstrates that LMMs can effectively perform fine-grained, frame-wise video understanding with limited supervision, reducing annotation burden and improving the usefulness of predicted scene graphs in downstream tasks. Overall, the findings highlight a practical path to scalable DSGG with strong interpretability through triplet importance ranking and robust grounding.

Abstract

Dynamic Scene Graph Generation (DSGG) for videos is a challenging task in computer vision. While existing approaches often focus on sophisticated architectural design and solely use recall during evaluation, we take a closer look at their predicted scene graphs and discover three critical issues with existing DSGG methods: severe precision-recall trade-off, lack of awareness on triplet importance, and inappropriate evaluation protocols. On the other hand, recent advances of Large Multimodal Models (LMMs) have shown great capabilities in video understanding, yet they have not been tested on fine-grained, frame-wise understanding tasks like DSGG. In this work, we conduct the first systematic analysis of Video LMMs for performing DSGG. Without relying on sophisticated architectural design, we show that LMMs with simple decoder-only structure can be turned into State-of-the-Art scene graph generators that effectively overcome the aforementioned issues, while requiring little finetuning (5-10% training data).

What can Off-the-Shelves Large Multi-Modal Models do for Dynamic Scene Graph Generation?

TL;DR

This work tackles Dynamic Scene Graph Generation (DSGG) for videos by leveraging off-the-shelf Large Multimodal Models (LMMs) with simple decoder-only architectures, showing that fine-tuning with as little as 5-10% of data yields state-of-the-art DSGG performance across SGCLS* and SGDet on Action Genome and VidVRD. It reframes DSGG as next-token prediction, grounding generated triplets with an open-vocabulary detector and introducing a Triplet Importance Prior to rank predictions by informativeness and novelty. The approach addresses long-tail predicates, mitigates the precision-recall imbalance typical of DSGG, and proposes a more realistic evaluation by incorporating both recall and precision and a ranking-based metric (nDCG). This work demonstrates that LMMs can effectively perform fine-grained, frame-wise video understanding with limited supervision, reducing annotation burden and improving the usefulness of predicted scene graphs in downstream tasks. Overall, the findings highlight a practical path to scalable DSGG with strong interpretability through triplet importance ranking and robust grounding.

Abstract

Dynamic Scene Graph Generation (DSGG) for videos is a challenging task in computer vision. While existing approaches often focus on sophisticated architectural design and solely use recall during evaluation, we take a closer look at their predicted scene graphs and discover three critical issues with existing DSGG methods: severe precision-recall trade-off, lack of awareness on triplet importance, and inappropriate evaluation protocols. On the other hand, recent advances of Large Multimodal Models (LMMs) have shown great capabilities in video understanding, yet they have not been tested on fine-grained, frame-wise understanding tasks like DSGG. In this work, we conduct the first systematic analysis of Video LMMs for performing DSGG. Without relying on sophisticated architectural design, we show that LMMs with simple decoder-only structure can be turned into State-of-the-Art scene graph generators that effectively overcome the aforementioned issues, while requiring little finetuning (5-10% training data).

Paper Structure

This paper contains 47 sections, 5 equations, 12 figures, 7 tables.

Figures (12)

  • Figure 1: Precision-Recall curve for popular DSGG sttrantempuravrdone methods and LLaVA-OneVision llavaov on two popular DSGG datasets: Action Genome ji2020actiongenome and ImageNet-VidVRD vidvrd with top-$K$ predictions where $K \in \{1,5,10,20,50\}$. We observe severe precision-recall trade-off on existing methods such as STTran sttran on Action Genome, and VrdONE vrdone on ImageNet-VidVRD: while existing works mainly report recall with large $K \in \{10, 20, 50\}$, their precisions drop rapidly at the same time, as opposed to LLaVA-OV, whose precision remains stable.
  • Figure 2: Visualization of top-3 prediction from STTran sttran, TEMPURA tempura, and LLaVA-OV llavaov with two sample video clips from Action Genome ji2020actiongenome. Red denotes incorrect prediction. Light blue denotes entities not being in the ground-truth but semantically correct. While STTran and TEMPURA predicts correct but uninformative triplets such as $\langle \textit{person}, \textit{standing\_on}, \textit{floor} \rangle$ (left), LLaVA-OV predicts $\langle \textit{person}, \textit{looking\_at}, \textit{closet} \rangle$, the primary activity of the scene.
  • Figure 3: A comparison between existing DSGG pipeline (top) and ours (bottom). Existing DSGG methods typically employ a bottom-up approach, where an external detector is used in the first place to extract local features of each detected objects, which are then passed to a series of sub-modules for handling object and relation classification, and temporal aggregation. On the other hand, our approach follows a top-down procedure, where the video is first passed to the pre-trained vision encoder and the language model to obtain a holistic understanding. The grounding is done after the scene graph is generated by utilizing a pre-trained open-vocabulary object detector.
  • Figure 4: LLaVA-OV's generation before and after finetuning with 5% training data. Wrong predicates are highlighted in red.
  • Figure 5: Per predicate class Recall/Precision performance between LLaVA-OV(orange) and TEMPURA(blue) with $K \in \{5, 10\}$ on Action Genome, under SGCLS*. LLaVA-OV is finetuned with 5% training data. The predicate categories are sorted from highest to lowest based on the number of occurrence.
  • ...and 7 more figures