Table of Contents
Fetching ...

LASER: A Neuro-Symbolic Framework for Learning Spatial-Temporal Scene Graphs with Weak Supervision

Jiani Huang, Ziyang Li, Mayur Naik, Ser-Nam Lim

TL;DR

LASER tackles learning spatio-temporal scene graphs from videos under weak supervision by using video captions to generate formal spatio-temporal specifications. It combines a probabilistic relational database for STSGs, a Spatio-Temporal Specification Language built on $LTL_f$ (STSL), and a differentiable neuro-symbolic alignment checker to train caption-driven STSG generators within a unified framework. A multi-component loss (contrastive, temporal, semantic) guides end-to-end optimization, and a NL-to-STSL pipeline converts captions into executable specifications. Evaluations on OpenPVSG, 20BN, and MUGEN show LASER can surpass some fully supervised baselines and achieve data-efficient learning, demonstrating the practical potential of weak supervision for fine-grained video semantics. The approach promises scalable STSG learning across open-domain vocabularies by leveraging foundation models and differentiable reasoning, though it remains sensitive to caption quality and long-horizon complexities.

Abstract

Supervised approaches for learning spatio-temporal scene graphs (STSG) from video are greatly hindered due to their reliance on STSG-annotated videos, which are labor-intensive to construct at scale. Is it feasible to instead use readily available video captions as weak supervision? To address this question, we propose LASER, a neuro-symbolic framework to enable training STSG generators using only video captions. LASER employs large language models to first extract logical specifications with rich spatio-temporal semantic information from video captions. LASER then trains the underlying STSG generator to align the predicted STSG with the specification. The alignment algorithm overcomes the challenges of weak supervision by leveraging a differentiable symbolic reasoner and using a combination of contrastive, temporal, and semantics losses. The overall approach efficiently trains low-level perception models to extract a fine-grained STSG that conforms to the video caption. In doing so, it enables a novel methodology for learning STSGs without tedious annotations. We evaluate our method on three video datasets: OpenPVSG, 20BN, and MUGEN. Our approach demonstrates substantial improvements over fully-supervised baselines, achieving a unary predicate prediction accuracy of 27.78% (+12.65%) and a binary recall@5 of 0.42 (+0.22) on OpenPVSG. Additionally, LASER exceeds baselines by 7% on 20BN and 5.2% on MUGEN in terms of overall predicate prediction accuracy.

LASER: A Neuro-Symbolic Framework for Learning Spatial-Temporal Scene Graphs with Weak Supervision

TL;DR

LASER tackles learning spatio-temporal scene graphs from videos under weak supervision by using video captions to generate formal spatio-temporal specifications. It combines a probabilistic relational database for STSGs, a Spatio-Temporal Specification Language built on (STSL), and a differentiable neuro-symbolic alignment checker to train caption-driven STSG generators within a unified framework. A multi-component loss (contrastive, temporal, semantic) guides end-to-end optimization, and a NL-to-STSL pipeline converts captions into executable specifications. Evaluations on OpenPVSG, 20BN, and MUGEN show LASER can surpass some fully supervised baselines and achieve data-efficient learning, demonstrating the practical potential of weak supervision for fine-grained video semantics. The approach promises scalable STSG learning across open-domain vocabularies by leveraging foundation models and differentiable reasoning, though it remains sensitive to caption quality and long-horizon complexities.

Abstract

Supervised approaches for learning spatio-temporal scene graphs (STSG) from video are greatly hindered due to their reliance on STSG-annotated videos, which are labor-intensive to construct at scale. Is it feasible to instead use readily available video captions as weak supervision? To address this question, we propose LASER, a neuro-symbolic framework to enable training STSG generators using only video captions. LASER employs large language models to first extract logical specifications with rich spatio-temporal semantic information from video captions. LASER then trains the underlying STSG generator to align the predicted STSG with the specification. The alignment algorithm overcomes the challenges of weak supervision by leveraging a differentiable symbolic reasoner and using a combination of contrastive, temporal, and semantics losses. The overall approach efficiently trains low-level perception models to extract a fine-grained STSG that conforms to the video caption. In doing so, it enables a novel methodology for learning STSGs without tedious annotations. We evaluate our method on three video datasets: OpenPVSG, 20BN, and MUGEN. Our approach demonstrates substantial improvements over fully-supervised baselines, achieving a unary predicate prediction accuracy of 27.78% (+12.65%) and a binary recall@5 of 0.42 (+0.22) on OpenPVSG. Additionally, LASER exceeds baselines by 7% on 20BN and 5.2% on MUGEN in terms of overall predicate prediction accuracy.
Paper Structure (30 sections, 9 equations, 13 figures, 9 tables)

This paper contains 30 sections, 9 equations, 13 figures, 9 tables.

Figures (13)

  • Figure 1: Illustration of the learning pipeline of LASER. The goal is to fine-tune a vision-language model to produce STSG without direct supervision on ground truth STSG labels. LASER relies on video captions for weak-supervision labels. We apply an LLM to extract a spatio-temporal specification from video captions. The LLM-inferred relational keywords, along with the input video, are then passed to a vision-language model to generate an STSG. At the end, a spatio-temporal alignment checker uses the specification to derive an alignment loss, capturing issues in the predicted STSG. The differentiable checker effectively back-propagates the loss to the vision-language model.
  • Figure 2: Pipeline illustration with CLIP as the backbone model for probabilistic STSG generation.
  • Figure 3: Pipeline utilizing 3-shot GPT-4 to convert natural language captions into: (1) programmatic spatio-temporal specification for alignment score calculation as input to the alignment checker, and (2) unary and binary keywords for predicting the probabilistic STSG as inputs to the neural model.
  • Figure 4: The formal syntax of STSL. Here, $\wedge$, $\vee$, and $\neg$ represents logical "and", "or", and "not". Formula may also contain temporal operators $\bigcirc$ (next), $\mathbf{U}$ (until), $\square$ (global), and $\lozenge$ (finally).
  • Figure 5: Formal semantics of STSL. $\langle w, [s: t] \rangle \models \psi$ means the STSL specification $\psi$ is aligned with the ST-SG $w$ starting from time $s$ till time $t$. We use $w \models \psi$ as an abbreviation for $\langle w, [1: m] \rangle \models \psi$, where $m$ is the full video length.
  • ...and 8 more figures