Table of Contents
Fetching ...

NeuS-QA: Grounding Long-Form Video Understanding in Temporal Logic and Neuro-Symbolic Reasoning

Sahil Shah, S P Sharan, Harsh Goel, Minkyu Choi, Mustafa Munir, Manvik Pasula, Radu Marculescu, Sandeep Chinchali

TL;DR

NeuS-QA addresses long-form video QA by grounding temporal reasoning in formal logic. It translates questions into temporal logic, builds a frame-level video automaton, and uses probabilistic model checking to identify segments that satisfy the specification before querying a vision-language model, improving interpretability and reducing hallucinations. The approach achieves over 10% gains on LVQA benchmarks such as LongVideoBench and CinePile, particularly for event ordering and multi-step reasoning, and demonstrates robustness to video duration and domain shifts. It is a training-free, plug-and-play framework with open-source code, illustrating a principled path toward scalable, interpretable temporally grounded video understanding as VLMs evolve.

Abstract

While vision-language models (VLMs) excel at tasks involving single images or short videos, they still struggle with Long Video Question Answering (LVQA) due to its demand for complex multi-step temporal reasoning. Vanilla approaches, which simply sample frames uniformly and feed them to a VLM along with the question, incur significant token overhead. This forces aggressive downsampling of long videos, causing models to miss fine-grained visual structure, subtle event transitions, and key temporal cues. Recent works attempt to overcome these limitations through heuristic approaches; however, they lack explicit mechanisms for encoding temporal relationships and fail to provide any formal guarantees that the sampled context actually encodes the compositional or causal logic required by the question. To address these foundational gaps, we introduce NeuS-QA, a training-free, plug-and-play neuro-symbolic pipeline for LVQA. NeuS-QA first translates a natural language question into a logic specification that models the temporal relationship between frame-level events. Next, we construct a video automaton to model the video's frame-by-frame event progression, and finally employ model checking to compare the automaton against the specification to identify all video segments that satisfy the question's logical requirements. Only these logic-verified segments are submitted to the VLM, thus improving interpretability, reducing hallucinations, and enabling compositional reasoning without modifying or fine-tuning the model. Experiments on the LongVideoBench and CinePile LVQA benchmarks show that NeuS-QA significantly improves performance by over 10%, particularly on questions involving event ordering, causality, and multi-step reasoning. We open-source our code at https://utaustin-swarmlab.github.io/NeuS-QA/.

NeuS-QA: Grounding Long-Form Video Understanding in Temporal Logic and Neuro-Symbolic Reasoning

TL;DR

NeuS-QA addresses long-form video QA by grounding temporal reasoning in formal logic. It translates questions into temporal logic, builds a frame-level video automaton, and uses probabilistic model checking to identify segments that satisfy the specification before querying a vision-language model, improving interpretability and reducing hallucinations. The approach achieves over 10% gains on LVQA benchmarks such as LongVideoBench and CinePile, particularly for event ordering and multi-step reasoning, and demonstrates robustness to video duration and domain shifts. It is a training-free, plug-and-play framework with open-source code, illustrating a principled path toward scalable, interpretable temporally grounded video understanding as VLMs evolve.

Abstract

While vision-language models (VLMs) excel at tasks involving single images or short videos, they still struggle with Long Video Question Answering (LVQA) due to its demand for complex multi-step temporal reasoning. Vanilla approaches, which simply sample frames uniformly and feed them to a VLM along with the question, incur significant token overhead. This forces aggressive downsampling of long videos, causing models to miss fine-grained visual structure, subtle event transitions, and key temporal cues. Recent works attempt to overcome these limitations through heuristic approaches; however, they lack explicit mechanisms for encoding temporal relationships and fail to provide any formal guarantees that the sampled context actually encodes the compositional or causal logic required by the question. To address these foundational gaps, we introduce NeuS-QA, a training-free, plug-and-play neuro-symbolic pipeline for LVQA. NeuS-QA first translates a natural language question into a logic specification that models the temporal relationship between frame-level events. Next, we construct a video automaton to model the video's frame-by-frame event progression, and finally employ model checking to compare the automaton against the specification to identify all video segments that satisfy the question's logical requirements. Only these logic-verified segments are submitted to the VLM, thus improving interpretability, reducing hallucinations, and enabling compositional reasoning without modifying or fine-tuning the model. Experiments on the LongVideoBench and CinePile LVQA benchmarks show that NeuS-QA significantly improves performance by over 10%, particularly on questions involving event ordering, causality, and multi-step reasoning. We open-source our code at https://utaustin-swarmlab.github.io/NeuS-QA/.

Paper Structure

This paper contains 31 sections, 4 equations, 4 figures, 5 tables, 1 algorithm.

Figures (4)

  • Figure 1: Comparison of frame selection strategies to answer temporally-grounded questions over long-form videos.
  • Figure 2: Video automaton for the running example. Top: representative frames $\mathcal{F}_1$ (woman pours hot water over granola), $\mathcal{F}_{21}$ (woman spoons yogurt into bowl), and $\mathcal{F}_{43}$ (woman places topping). Bottom: finite-state automaton synthesized from the TL specification in $\varphi$ of the running example. Green circles indicate valid frames---those in which at least one domain-specific atomic proposition (e.g., "woman pours hot water over granola") is detected and whose inclusion preserves all ordering constraints accumulated so far. Red circles mark frames that (i) contain no relevant propositions or (ii) would force a transition contradicting the specification (for instance, "woman grabs spoon but doesn't spoon yogurt into bowl"). We verify the automaton incrementally: whenever a new frame is labelled green, we extend the current run $\rho = q_0 \rightarrow \dots \rightarrow q_t$ and immediately re-evaluate the property $\rho \models \varphi$. If the property still holds, the frame remains green; otherwise, it is re-labelled red and excluded. The procedure terminates once the accepting state $q_n$ is reached, guaranteeing that the sequence of green frames is a witness trace satisfying the specification.
  • Figure 3: Accuracy (%) gains of NeuS-QA over ground truth frame annotations. We visualize the per-model improvement of NeuS-QA in terms of absolute accuracy difference relative to ground truth annotations across different video durations. Positive values indicate that NeuS-QA outperforms ground truth frame annotations.
  • Figure 4: Qualitative Examples from the NeuS-QA pipeline. Compared to other structured reasoning frameworks, NeuS-QA more accurately identifies the correct frames of interest in a long video.