Towards Neuro-Symbolic Video Understanding
Minkyu Choi, Harsh Goel, Mohammad Omama, Yunhao Yang, Sahil Shah, Sandeep Chinchali
TL;DR
The paper tackles long-horizon video understanding by addressing the failure of state-of-the-art video-language models to perform robust temporal reasoning across frames. It introduces NSVS-TL, a neuro-symbolic framework that separates neural perception for per-frame semantic propositions from temporal logic based reasoning implemented via a probabilistic automaton and formal verification. Key contributions include a four-step methodology (calibration, frame validation, dynamic automaton construction, model checking), a formal TL-based pipeline verified with a probabilistic model checker, and the Temporal Logic Video (TLV) datasets comprising synthetic and real driving data. Empirical results show a 9-15% improvement in F1 for complex event identification on Waymo and NuScenes, with TL-based reasoning outperforming LLM-based reasoning on temporally extended queries and maintaining stability as video length increases, highlighting its practical impact for scalable long-horizon video retrieval.
Abstract
The unprecedented surge in video data production in recent years necessitates efficient tools to extract meaningful frames from videos for downstream tasks. Long-term temporal reasoning is a key desideratum for frame retrieval systems. While state-of-the-art foundation models, like VideoLLaMA and ViCLIP, are proficient in short-term semantic understanding, they surprisingly fail at long-term reasoning across frames. A key reason for this failure is that they intertwine per-frame perception and temporal reasoning into a single deep network. Hence, decoupling but co-designing semantic understanding and temporal reasoning is essential for efficient scene identification. We propose a system that leverages vision-language models for semantic understanding of individual frames but effectively reasons about the long-term evolution of events using state machines and temporal logic (TL) formulae that inherently capture memory. Our TL-based reasoning improves the F1 score of complex event identification by 9-15% compared to benchmarks that use GPT4 for reasoning on state-of-the-art self-driving datasets such as Waymo and NuScenes.
