Table of Contents
Fetching ...

Revisiting Weakly-Supervised Video Scene Graph Generation via Pair Affinity Learning

Minseok Kang, Minhyeok Lee, Minjung Kim, Jungho Lee, Donghyeong Kim, Sungmin Woo, Inseok Jeon, Sangyoun Lee

Abstract

Weakly-supervised video scene graph generation (WS-VSGG) aims to parse video content into structured relational triplets without bounding box annotations and with only sparse temporal labeling, significantly reducing annotation costs. Without ground-truth bounding boxes, these methods rely on off-the-shelf detectors to generate object proposals, yet largely overlook a fundamental discrepancy from fullysupervised pipelines. Fully-supervised detectors implicitly filter out noninteractive objects, while off-the-shelf detectors indiscriminately detect all visible objects, overwhelming relation models with noisy pairs.We address this by introducing a learnable pair affinity that estimates the likelihood of interaction between subject-object pairs. Through Pair Affinity Learning and Scoring (PALS), pair affinity is incorporated into inferencetime ranking and further integrated into contextual reasoning through Pair Affinity Modulation (PAM), enabling the model to suppress noninteractive pairs and focus on relationally meaningful ones. To provide cleaner supervision for pair affinity learning, we further propose Relation- Aware Matching (RAM), which leverages vision-language grounding to resolve class-level ambiguity in pseudo-label generation. Extensive experiments on Action Genome demonstrate that our approach consistently yields substantial improvements across different baselines and backbones, achieving state-of-the-art WS-VSGG performance.

Revisiting Weakly-Supervised Video Scene Graph Generation via Pair Affinity Learning

Abstract

Weakly-supervised video scene graph generation (WS-VSGG) aims to parse video content into structured relational triplets without bounding box annotations and with only sparse temporal labeling, significantly reducing annotation costs. Without ground-truth bounding boxes, these methods rely on off-the-shelf detectors to generate object proposals, yet largely overlook a fundamental discrepancy from fullysupervised pipelines. Fully-supervised detectors implicitly filter out noninteractive objects, while off-the-shelf detectors indiscriminately detect all visible objects, overwhelming relation models with noisy pairs.We address this by introducing a learnable pair affinity that estimates the likelihood of interaction between subject-object pairs. Through Pair Affinity Learning and Scoring (PALS), pair affinity is incorporated into inferencetime ranking and further integrated into contextual reasoning through Pair Affinity Modulation (PAM), enabling the model to suppress noninteractive pairs and focus on relationally meaningful ones. To provide cleaner supervision for pair affinity learning, we further propose Relation- Aware Matching (RAM), which leverages vision-language grounding to resolve class-level ambiguity in pseudo-label generation. Extensive experiments on Action Genome demonstrate that our approach consistently yields substantial improvements across different baselines and backbones, achieving state-of-the-art WS-VSGG performance.
Paper Structure (37 sections, 10 equations, 11 figures, 13 tables)

This paper contains 37 sections, 10 equations, 11 figures, 13 tables.

Figures (11)

  • Figure 1: Detection distribution and matching in FS-VSGG and WS-VSGG.(a) Per-frame detection statistics in the weakly annotated training set. Detections are categorized as True Match (spatial overlap with a ground-truth), False Match (class-matched without overlap), or Unmatched. (b) Detection and matching comparison for the triplet $\langle$person, holding, cup/glass/bottle$\rangle$. Fully-supervised detector produces relation-relevant proposals, whereas off-the-shelf detector generates many irrelevant objects, leading to false matches (e.g., cup 2) under class-level matching. Our method retains only the interaction-consistent instance (cup 1).
  • Figure 2: Comparison of WS-VSGG training pipelines.$\mathcal{G}^u$: annotation of unlocalized triplets, $\mathcal{D}_t$: detected object proposals. (a) Only matched pairs $\mathcal{P}^+$ are used for training; unmatched pairs $\mathcal{P}^-$ are discarded. (b) RAM refines the matching, and both $\mathcal{P}^+$ and $\mathcal{P}^-$ are used to learn predicate classification and pair affinity jointly.
  • Figure 3: Overview of Relation-Aware Matching (RAM). For each entity in a triplet, a relation-aware query is constructed and fed into a VL grounding model. The [CLS] cross-attention map localizes the described relation, and reliability estimation determines whether to perform grounded matching via Grounding Score (GS) or fall back to class-level matching. The figure illustrates the process for $c^o$; the same procedure is applied independently to $c^s$.
  • Figure 4: Overview of PALS and PAM. (a) The relation embedding $\mathbf{R}_0$ and pair affinity embedding $\mathbf{P}_0$ are jointly updated through $L$ spatial and temporal attention blocks, then decoded into predicate classification scores $\text{PC}$ and pair affinity scores $\text{PA}$ for inference. (b) Inside each attention block, the affinity matrix $\mathbf{G}_i = \mathbf{P}_i \mathbf{P}_i^\top$ gates the attention logits, and the attention output updates both $\mathbf{R}$ and $\mathbf{P}$ via residual connections. The figure illustrates spatial attention for clarity; temporal attention follows the same mechanism with backbone-specific sequence grouping across frames.
  • Figure 5: Pair affinity score distributions on the AG test set. Negative pairs cluster near zero (left); positive pairs peak around 0.7--0.9 (right).
  • ...and 6 more figures