Table of Contents
Fetching ...

LE-NeuS: Latency-Efficient Neuro-Symbolic Video Understanding via Adaptive Temporal Verification

Shawn Liang, Sahil Shah, Chengwei Zhou, SP Sharan, Harsh Goel, Arnab Sanyal, Sandeep Chinchali, Gourav Datta

TL;DR

LE-NeuS is presented, a latency-efficient neuro-symbolic framework that preserves the accuracy benefits of temporal logic-guided video understanding while drastically reducing inference latency and reduces the latency gap from 90x to approximately 10x while maintaining>10% accuracy gains on temporally complex queries.

Abstract

Neuro-symbolic approaches to long-form video question answering (LVQA) have demonstrated significant accuracy improvements by grounding temporal reasoning in formal verification. However, existing methods incur prohibitive latency overheads, up to 90x slower than base VLM prompting, rendering them impractical for latency-sensitive edge deployments. We present LE-NeuS, a latency-efficient neuro-symbolic framework that preserves the accuracy benefits of temporal logic-guided video understanding while drastically reducing inference latency. Our key insight is that the dominant computational bottleneck arises from sequential and dense proposition detection across video frames during automaton construction. We address this through two principled optimizations: (1) CLIP guided two-stage adaptive sampling that exploits visual redundancy to skip semantically similar frames while preserving temporal boundaries, and (2) batched proposition detection that parallelizes VLM inference across temporal windows. Theoretically, we derive latency bounds as a function of video length, proposition complexity, and sampling density, establishing conditions under which latency efficiency is achievable. Empirically, on LongVideoBench and Video-MME benchmarks deployed on NVIDIA H100 GPUs, LE-NeuS reduces the latency gap from 90x to approximately 10x while maintaining >10% accuracy gains on temporally complex queries.

LE-NeuS: Latency-Efficient Neuro-Symbolic Video Understanding via Adaptive Temporal Verification

TL;DR

LE-NeuS is presented, a latency-efficient neuro-symbolic framework that preserves the accuracy benefits of temporal logic-guided video understanding while drastically reducing inference latency and reduces the latency gap from 90x to approximately 10x while maintaining>10% accuracy gains on temporally complex queries.

Abstract

Neuro-symbolic approaches to long-form video question answering (LVQA) have demonstrated significant accuracy improvements by grounding temporal reasoning in formal verification. However, existing methods incur prohibitive latency overheads, up to 90x slower than base VLM prompting, rendering them impractical for latency-sensitive edge deployments. We present LE-NeuS, a latency-efficient neuro-symbolic framework that preserves the accuracy benefits of temporal logic-guided video understanding while drastically reducing inference latency. Our key insight is that the dominant computational bottleneck arises from sequential and dense proposition detection across video frames during automaton construction. We address this through two principled optimizations: (1) CLIP guided two-stage adaptive sampling that exploits visual redundancy to skip semantically similar frames while preserving temporal boundaries, and (2) batched proposition detection that parallelizes VLM inference across temporal windows. Theoretically, we derive latency bounds as a function of video length, proposition complexity, and sampling density, establishing conditions under which latency efficiency is achievable. Empirically, on LongVideoBench and Video-MME benchmarks deployed on NVIDIA H100 GPUs, LE-NeuS reduces the latency gap from 90x to approximately 10x while maintaining >10% accuracy gains on temporally complex queries.
Paper Structure (9 sections, 4 theorems, 2 equations, 3 figures, 3 tables, 1 algorithm)

This paper contains 9 sections, 4 theorems, 2 equations, 3 figures, 3 tables, 1 algorithm.

Key Result

theorem 1

The end-to-end latency of LE-NeuS is bounded by where $\alpha$ is the ratio of frames retained after semantic filtering, defined as $\alpha = \frac{|\mathcal{F}_{cand}|}{T}$, $\rho$ is the keyframe retention rate, defined as the ratio of unique keyframes retained from the candidate set: $\rho = \frac{|\mathcal{K}|}{|\mathcal{F}_{cand}|}$, and $\m

Figures (3)

  • Figure 1: (a) Vanilla VLM Prompting uniformly samples a fixed number of frames across the entire video, irrespective of semantic relevance, which can omit critical temporal transitions and dilute reasoning with background content. (b) Heuristic Retrieval selects frames based on similarity to the query, improving relevance but lacking explicit temporal structure or formal reasoning guarantees. (c) NeuS-QA (Baseline) grounds atomic propositions over densely sampled frames, constructs a temporal logic specification, and performs formal model checking to retrieve a single continuous logic-satisfying segment; while principled and interpretable, this sequential grounding process incurs substantial latency overhead. (d) LE-NeuS (Ours) introduces CLIP-guided adaptive sampling and batched proposition detection to selectively ground propositions over semantically sparse, visually distinct keyframes. It retrieves multiple high-density, logic-consistent segments and performs verification with parallelized inference, achieving over an order-of-magnitude latency reduction while preserving formal reasoning guarantees and improving answer accuracy.
  • Figure 2: Video automaton for the running example: "After the man goes into the forest, finds and debarks the tree branches, what does he use them for?" Top: representative frames for key events (entering the forest, finding and debarking branches, and using them). Bottom: finite-state automaton derived from the temporal logic specification $\varphi$. Green states indicate frames where relevant atomic propositions are detected and temporal constraints remain satisfied; red states denote irrelevant frames or transitions that violate $\varphi$. Incremental verification ensures that only runs $\rho \models \varphi$ are retained, and reaching the accepting state yields a witness segment satisfying the query.
  • Figure 3: Top-1 Accuracy (%) on Video-MME, and MLVU. We evaluate LE-NeuS against the NeuS-QA baseline and Base VLM inference for subsets of Video-MME and MLVU. LE-NeuS consistently achieves superior performance, surpassing Base-VLM and NeuS-QA across all question subsets.

Theorems & Definitions (10)

  • definition 1: NeuS Pipeline Latency
  • definition 2: Sequential Automaton Construction Latency
  • definition 3: Batched Proposition-Window Pair Evaluation
  • definition 4: Semantic Relevancy Score
  • definition 5: Visual Redundancy Score
  • definition 6: Sequential Adaptive Keyframe Selection
  • theorem 1: LE-NeuS Latency Bound
  • theorem 2: Condition for Latency Efficiency
  • proposition 1: Speedup over Baseline
  • proposition 2: Speedup over Baseline