Table of Contents
Fetching ...

LogSTOP: Temporal Scores over Prediction Sequences for Matching and Retrieval

Avishree Khare, Hideki Okamoto, Bardh Hoxha, Georgios Fainekos, Rajeev Alur

TL;DR

This work addresses the challenge of scoring temporal properties over sequences when local predictions are noisy. It introduces LogSTOP, a linear-time algorithm for computing STOP scores from local predictors under Linear Temporal Logic (LTL), using downsampling, smoothing, and log-space accumulation to robustly handle prediction noise. It also provides an adaptive threshold for query matching and a subsequence-based retrieval method with O(T^2 |φ|) complexity for ranking sequences by temporal relevance. Evaluations on the QMTP and TP2VR benchmarks across objects, actions, and emotions show that LogSTOP with lightweight detectors outperforms large vision-language and audio-language models, as well as existing temporal logic baselines, highlighting the potential of logic-based temporal reasoning for efficient, scalable temporal querying and retrieval. The work also points to future directions in expressive logics and multi-modal extensions to broaden applicability.

Abstract

Neural models such as YOLO and HuBERT can be used to detect local properties such as objects ("car") and emotions ("angry") in individual frames of videos and audio clips respectively. The likelihood of these detections is indicated by scores in [0, 1]. Lifting these scores to temporal properties over sequences can be useful for several downstream applications such as query matching (e.g., "does the speaker eventually sound happy in this audio clip?"), and ranked retrieval (e.g., "retrieve top 5 videos with a 10 second scene where a car is detected until a pedestrian is detected"). In this work, we formalize this problem of assigning Scores for TempOral Properties (STOPs) over sequences, given potentially noisy score predictors for local properties. We then propose a scoring function called LogSTOP that can efficiently compute these scores for temporal properties represented in Linear Temporal Logic. Empirically, LogSTOP, with YOLO and HuBERT, outperforms Large Vision / Audio Language Models and other Temporal Logic-based baselines by at least 16% on query matching with temporal properties over objects-in-videos and emotions-in-speech respectively. Similarly, on ranked retrieval with temporal properties over objects and actions in videos, LogSTOP with Grounding DINO and SlowR50 reports at least a 19% and 16% increase in mean average precision and recall over zero-shot text-to-video retrieval baselines respectively.

LogSTOP: Temporal Scores over Prediction Sequences for Matching and Retrieval

TL;DR

This work addresses the challenge of scoring temporal properties over sequences when local predictions are noisy. It introduces LogSTOP, a linear-time algorithm for computing STOP scores from local predictors under Linear Temporal Logic (LTL), using downsampling, smoothing, and log-space accumulation to robustly handle prediction noise. It also provides an adaptive threshold for query matching and a subsequence-based retrieval method with O(T^2 |φ|) complexity for ranking sequences by temporal relevance. Evaluations on the QMTP and TP2VR benchmarks across objects, actions, and emotions show that LogSTOP with lightweight detectors outperforms large vision-language and audio-language models, as well as existing temporal logic baselines, highlighting the potential of logic-based temporal reasoning for efficient, scalable temporal querying and retrieval. The work also points to future directions in expressive logics and multi-modal extensions to broaden applicability.

Abstract

Neural models such as YOLO and HuBERT can be used to detect local properties such as objects ("car") and emotions ("angry") in individual frames of videos and audio clips respectively. The likelihood of these detections is indicated by scores in [0, 1]. Lifting these scores to temporal properties over sequences can be useful for several downstream applications such as query matching (e.g., "does the speaker eventually sound happy in this audio clip?"), and ranked retrieval (e.g., "retrieve top 5 videos with a 10 second scene where a car is detected until a pedestrian is detected"). In this work, we formalize this problem of assigning Scores for TempOral Properties (STOPs) over sequences, given potentially noisy score predictors for local properties. We then propose a scoring function called LogSTOP that can efficiently compute these scores for temporal properties represented in Linear Temporal Logic. Empirically, LogSTOP, with YOLO and HuBERT, outperforms Large Vision / Audio Language Models and other Temporal Logic-based baselines by at least 16% on query matching with temporal properties over objects-in-videos and emotions-in-speech respectively. Similarly, on ranked retrieval with temporal properties over objects and actions in videos, LogSTOP with Grounding DINO and SlowR50 reports at least a 19% and 16% increase in mean average precision and recall over zero-shot text-to-video retrieval baselines respectively.

Paper Structure

This paper contains 23 sections, 24 equations, 6 figures, 5 tables, 2 algorithms.

Figures (6)

  • Figure 1: LogSTOPs for three videos with respect to the query "Is there a person in all frames of this video?". Video 2 with occluded persons is assigned a lower score than video 1 (where a person is visible in all frames), and higher score than video 3 (where there are frames with no persons). These scores can be used for ranking and query matching with the adaptive threshold we define in Section \ref{['sec:logstop_qm']}. YOLOv8x is used here to detect objects in individual frames of the videos.
  • Figure 2: LogSTOP outperforms other methods on the QMTP-video and QMTP-speech datasets. The average balanced accuracy for the five temporal property categories and overall is presented. Detailed results for all queries are provided in Appendix \ref{['appendix:qm_results']}.
  • Figure 3: The adaptive threshold accepts more matching sequences than the constant $\log 0.5$ threshold. LogSTOPs with YOLOv8 (mean with 95% CI) are shown for sequences from QMTP-video. Comparison is shown for three properties, with results for other properties in Appendix \ref{['appendix:thresholds']}.
  • Figure 4: LogSTOP outperforms zero-shot text-to-video retrieval methods on the TP2VR benchmark ($r$ denotes the number of relevant sequences). Detailed results are in Appendix \ref{['appendix:vr_results']}.
  • Figure 5: Examples of video retrieval with different methods, from the TP2VR-objects and TP2VR-actions datasets. The event length ranges in terms of number of frames are mentioned with the temporal properties. Detailed discussion of these examples is in Appendix \ref{['appendix:examples']}
  • ...and 1 more figures