LASER: A Neuro-Symbolic Framework for Learning Spatial-Temporal Scene Graphs with Weak Supervision

Jiani Huang; Ziyang Li; Mayur Naik; Ser-Nam Lim

LASER: A Neuro-Symbolic Framework for Learning Spatial-Temporal Scene Graphs with Weak Supervision

Jiani Huang, Ziyang Li, Mayur Naik, Ser-Nam Lim

TL;DR

LASER tackles learning spatio-temporal scene graphs from videos under weak supervision by using video captions to generate formal spatio-temporal specifications. It combines a probabilistic relational database for STSGs, a Spatio-Temporal Specification Language built on $LTL_f$ (STSL), and a differentiable neuro-symbolic alignment checker to train caption-driven STSG generators within a unified framework. A multi-component loss (contrastive, temporal, semantic) guides end-to-end optimization, and a NL-to-STSL pipeline converts captions into executable specifications. Evaluations on OpenPVSG, 20BN, and MUGEN show LASER can surpass some fully supervised baselines and achieve data-efficient learning, demonstrating the practical potential of weak supervision for fine-grained video semantics. The approach promises scalable STSG learning across open-domain vocabularies by leveraging foundation models and differentiable reasoning, though it remains sensitive to caption quality and long-horizon complexities.

Abstract

Supervised approaches for learning spatio-temporal scene graphs (STSG) from video are greatly hindered due to their reliance on STSG-annotated videos, which are labor-intensive to construct at scale. Is it feasible to instead use readily available video captions as weak supervision? To address this question, we propose LASER, a neuro-symbolic framework to enable training STSG generators using only video captions. LASER employs large language models to first extract logical specifications with rich spatio-temporal semantic information from video captions. LASER then trains the underlying STSG generator to align the predicted STSG with the specification. The alignment algorithm overcomes the challenges of weak supervision by leveraging a differentiable symbolic reasoner and using a combination of contrastive, temporal, and semantics losses. The overall approach efficiently trains low-level perception models to extract a fine-grained STSG that conforms to the video caption. In doing so, it enables a novel methodology for learning STSGs without tedious annotations. We evaluate our method on three video datasets: OpenPVSG, 20BN, and MUGEN. Our approach demonstrates substantial improvements over fully-supervised baselines, achieving a unary predicate prediction accuracy of 27.78% (+12.65%) and a binary recall@5 of 0.42 (+0.22) on OpenPVSG. Additionally, LASER exceeds baselines by 7% on 20BN and 5.2% on MUGEN in terms of overall predicate prediction accuracy.

LASER: A Neuro-Symbolic Framework for Learning Spatial-Temporal Scene Graphs with Weak Supervision

TL;DR

(STSL), and a differentiable neuro-symbolic alignment checker to train caption-driven STSG generators within a unified framework. A multi-component loss (contrastive, temporal, semantic) guides end-to-end optimization, and a NL-to-STSL pipeline converts captions into executable specifications. Evaluations on OpenPVSG, 20BN, and MUGEN show LASER can surpass some fully supervised baselines and achieve data-efficient learning, demonstrating the practical potential of weak supervision for fine-grained video semantics. The approach promises scalable STSG learning across open-domain vocabularies by leveraging foundation models and differentiable reasoning, though it remains sensitive to caption quality and long-horizon complexities.

Abstract

Paper Structure (30 sections, 9 equations, 13 figures, 9 tables)

This paper contains 30 sections, 9 equations, 13 figures, 9 tables.

Introduction
Related Work
Methodology
Video to Probabilistic Relational Database
Spatio-Temporal Specification Language (STSL)
Natural Language to Programmatic Spatio-Temporal Specification
Spatio-Temporal Alignment Checking
Loss Function
Evaluation
OpenPVSG Dataset
20BN Dataset
MUGEN Dataset
Conclusion, Limitation, and Future Outlook
Natural Language to STSL Specification
Character Setup
...and 15 more sections

Figures (13)

Figure 1: Illustration of the learning pipeline of LASER. The goal is to fine-tune a vision-language model to produce STSG without direct supervision on ground truth STSG labels. LASER relies on video captions for weak-supervision labels. We apply an LLM to extract a spatio-temporal specification from video captions. The LLM-inferred relational keywords, along with the input video, are then passed to a vision-language model to generate an STSG. At the end, a spatio-temporal alignment checker uses the specification to derive an alignment loss, capturing issues in the predicted STSG. The differentiable checker effectively back-propagates the loss to the vision-language model.
Figure 2: Pipeline illustration with CLIP as the backbone model for probabilistic STSG generation.
Figure 3: Pipeline utilizing 3-shot GPT-4 to convert natural language captions into: (1) programmatic spatio-temporal specification for alignment score calculation as input to the alignment checker, and (2) unary and binary keywords for predicting the probabilistic STSG as inputs to the neural model.
Figure 4: The formal syntax of STSL. Here, $\wedge$, $\vee$, and $\neg$ represents logical "and", "or", and "not". Formula may also contain temporal operators $\bigcirc$ (next), $\mathbf{U}$ (until), $\square$ (global), and $\lozenge$ (finally).
Figure 5: Formal semantics of STSL. $\langle w, [s: t] \rangle \models \psi$ means the STSL specification $\psi$ is aligned with the ST-SG $w$ starting from time $s$ till time $t$. We use $w \models \psi$ as an abbreviation for $\langle w, [1: m] \rangle \models \psi$, where $m$ is the full video length.
...and 8 more figures

LASER: A Neuro-Symbolic Framework for Learning Spatial-Temporal Scene Graphs with Weak Supervision

TL;DR

Abstract

LASER: A Neuro-Symbolic Framework for Learning Spatial-Temporal Scene Graphs with Weak Supervision

Authors

TL;DR

Abstract

Table of Contents

Figures (13)