LASER: A Neuro-Symbolic Framework for Learning Spatial-Temporal Scene Graphs with Weak Supervision
Jiani Huang, Ziyang Li, Mayur Naik, Ser-Nam Lim
TL;DR
LASER tackles learning spatio-temporal scene graphs from videos under weak supervision by using video captions to generate formal spatio-temporal specifications. It combines a probabilistic relational database for STSGs, a Spatio-Temporal Specification Language built on $LTL_f$ (STSL), and a differentiable neuro-symbolic alignment checker to train caption-driven STSG generators within a unified framework. A multi-component loss (contrastive, temporal, semantic) guides end-to-end optimization, and a NL-to-STSL pipeline converts captions into executable specifications. Evaluations on OpenPVSG, 20BN, and MUGEN show LASER can surpass some fully supervised baselines and achieve data-efficient learning, demonstrating the practical potential of weak supervision for fine-grained video semantics. The approach promises scalable STSG learning across open-domain vocabularies by leveraging foundation models and differentiable reasoning, though it remains sensitive to caption quality and long-horizon complexities.
Abstract
Supervised approaches for learning spatio-temporal scene graphs (STSG) from video are greatly hindered due to their reliance on STSG-annotated videos, which are labor-intensive to construct at scale. Is it feasible to instead use readily available video captions as weak supervision? To address this question, we propose LASER, a neuro-symbolic framework to enable training STSG generators using only video captions. LASER employs large language models to first extract logical specifications with rich spatio-temporal semantic information from video captions. LASER then trains the underlying STSG generator to align the predicted STSG with the specification. The alignment algorithm overcomes the challenges of weak supervision by leveraging a differentiable symbolic reasoner and using a combination of contrastive, temporal, and semantics losses. The overall approach efficiently trains low-level perception models to extract a fine-grained STSG that conforms to the video caption. In doing so, it enables a novel methodology for learning STSGs without tedious annotations. We evaluate our method on three video datasets: OpenPVSG, 20BN, and MUGEN. Our approach demonstrates substantial improvements over fully-supervised baselines, achieving a unary predicate prediction accuracy of 27.78% (+12.65%) and a binary recall@5 of 0.42 (+0.22) on OpenPVSG. Additionally, LASER exceeds baselines by 7% on 20BN and 5.2% on MUGEN in terms of overall predicate prediction accuracy.
