Table of Contents
Fetching ...

Towards Neuro-Symbolic Video Understanding

Minkyu Choi, Harsh Goel, Mohammad Omama, Yunhao Yang, Sahil Shah, Sandeep Chinchali

TL;DR

The paper tackles long-horizon video understanding by addressing the failure of state-of-the-art video-language models to perform robust temporal reasoning across frames. It introduces NSVS-TL, a neuro-symbolic framework that separates neural perception for per-frame semantic propositions from temporal logic based reasoning implemented via a probabilistic automaton and formal verification. Key contributions include a four-step methodology (calibration, frame validation, dynamic automaton construction, model checking), a formal TL-based pipeline verified with a probabilistic model checker, and the Temporal Logic Video (TLV) datasets comprising synthetic and real driving data. Empirical results show a 9-15% improvement in F1 for complex event identification on Waymo and NuScenes, with TL-based reasoning outperforming LLM-based reasoning on temporally extended queries and maintaining stability as video length increases, highlighting its practical impact for scalable long-horizon video retrieval.

Abstract

The unprecedented surge in video data production in recent years necessitates efficient tools to extract meaningful frames from videos for downstream tasks. Long-term temporal reasoning is a key desideratum for frame retrieval systems. While state-of-the-art foundation models, like VideoLLaMA and ViCLIP, are proficient in short-term semantic understanding, they surprisingly fail at long-term reasoning across frames. A key reason for this failure is that they intertwine per-frame perception and temporal reasoning into a single deep network. Hence, decoupling but co-designing semantic understanding and temporal reasoning is essential for efficient scene identification. We propose a system that leverages vision-language models for semantic understanding of individual frames but effectively reasons about the long-term evolution of events using state machines and temporal logic (TL) formulae that inherently capture memory. Our TL-based reasoning improves the F1 score of complex event identification by 9-15% compared to benchmarks that use GPT4 for reasoning on state-of-the-art self-driving datasets such as Waymo and NuScenes.

Towards Neuro-Symbolic Video Understanding

TL;DR

The paper tackles long-horizon video understanding by addressing the failure of state-of-the-art video-language models to perform robust temporal reasoning across frames. It introduces NSVS-TL, a neuro-symbolic framework that separates neural perception for per-frame semantic propositions from temporal logic based reasoning implemented via a probabilistic automaton and formal verification. Key contributions include a four-step methodology (calibration, frame validation, dynamic automaton construction, model checking), a formal TL-based pipeline verified with a probabilistic model checker, and the Temporal Logic Video (TLV) datasets comprising synthetic and real driving data. Empirical results show a 9-15% improvement in F1 for complex event identification on Waymo and NuScenes, with TL-based reasoning outperforming LLM-based reasoning on temporally extended queries and maintaining stability as video length increases, highlighting its practical impact for scalable long-horizon video retrieval.

Abstract

The unprecedented surge in video data production in recent years necessitates efficient tools to extract meaningful frames from videos for downstream tasks. Long-term temporal reasoning is a key desideratum for frame retrieval systems. While state-of-the-art foundation models, like VideoLLaMA and ViCLIP, are proficient in short-term semantic understanding, they surprisingly fail at long-term reasoning across frames. A key reason for this failure is that they intertwine per-frame perception and temporal reasoning into a single deep network. Hence, decoupling but co-designing semantic understanding and temporal reasoning is essential for efficient scene identification. We propose a system that leverages vision-language models for semantic understanding of individual frames but effectively reasons about the long-term evolution of events using state machines and temporal logic (TL) formulae that inherently capture memory. Our TL-based reasoning improves the F1 score of complex event identification by 9-15% compared to benchmarks that use GPT4 for reasoning on state-of-the-art self-driving datasets such as Waymo and NuScenes.
Paper Structure (33 sections, 12 equations, 6 figures, 2 tables)

This paper contains 33 sections, 12 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: NSVS-TL Pipeline. The input query --- "Find the 'I'm Flying scene from Titanic" --- is first decomposed into semantically meaningful atomic propositions such as "man hugging woman", "ship on the sea", and "kiss" from a high-level user query. SOTA vision and vision-language models are then employed to annotate the existence of these atomic propositions in each video frame. Subsequently, we construct a probabilistic automaton that models the video's temporal evolution based on the list of per-frame atomic propositions detected in the video. Finally, we evaluate when and where this automaton satisfies the user's query. We do this by expressing it in a formal specification language that incorporates temporal logic. The TL equivalent of the above query is ALWAYS ($\Box$) "man hugging woman" UNTIL ($\mathsf{U}$) "ship on the sea" UNTIL ($\mathsf{U}$) "kiss". Formal verification techniques are utilized on the automaton to retrieve scenes that satisfy the TL specification.
  • Figure 2: Comparative Performance on the Event Identification Task: Video Language Models versus NSVS-TL. The accuracy of event identification with Video Language Models (Blue/Green) drops as video length or query complexity increases. In contrast, NSVS-TL (Orange) shows consistent performance irrespective of video length or query complexity.
  • Figure 3: Sample Automaton of the Running Example. Illustrates transitions from $\mathcal{F}_1$ to $\mathcal{F}_3$. Key transitions include: $q_{1,2} \to q_{2,0}$ with a 0.62 probability for only children, $q_{1,2} \to q_{2,1}$ with a 0.14 probability for both sign and children, and $q_{1,2} \to q_{2,2}$ with a 0.24 probability for only the sign. There are only children in $\mathcal{F}_3$. Therefore, all states in $\mathcal{F}_2$ are connected to $q_3$ with 1.0 probability.
  • Figure 4: Calibrating Neural Perception Models. We empirically select the optimal false positive threshold as shown in the left figure, while the right figure illustrates mapping estimation functions optimized for each neural perception model.
  • Figure 5: Performance of NSVS-TL Across Different Video Lengths.\ref{['fig:fig5a_performance_different_nn']} demonstrates the impact of various neural perception models on scene identification performance. Additionally, \ref{['fig:fig5b_performance_in_durations']} illustrates the F1 scores for scene retrieval against the video length, fulfilling the "A until B" temporal specification.
  • ...and 1 more figures