Table of Contents
Fetching ...

LazyVLM: Neuro-Symbolic Approach to Video Analytics

Xiangru Jian, Wei Pang, Zhengyuan Dong, Chao Zhang, M. Tamer Özsu

TL;DR

LazyVLM tackles scalable open-domain video moment retrieval by replacing monolithic end-to-end VLM inference with a neuro-symbolic pipeline that uses semi-structured SPO queries and precomputed scene graphs. It decomposes queries into semantic search, relational (symbolic) verification, and lightweight VLM refinements, enabling parallel execution and substantial reduction in costly video-scale processing, relative to traditional VLMs with $O(n^2)$ context complexity. Key contributions include a SPO-based query interface, Entity and Relationship Stores with frame-level embeddings, and a four-stage query processing pipeline (Entity Matching, SQL generation, Relationship Matching, Temporal Matching) that supports incremental updates. The approach enables scalable, accurate, and user-friendly open-domain video analytics at scale with practical impact for real-world video data workflows.

Abstract

Current video analytics approaches face a fundamental trade-off between flexibility and efficiency. End-to-end Vision Language Models (VLMs) often struggle with long-context processing and incur high computational costs, while neural-symbolic methods depend heavily on manual labeling and rigid rule design. In this paper, we introduce LazyVLM, a neuro-symbolic video analytics system that provides a user-friendly query interface similar to VLMs, while addressing their scalability limitation. LazyVLM enables users to effortlessly drop in video data and specify complex multi-frame video queries using a semi-structured text interface for video analytics. To address the scalability limitations of VLMs, LazyVLM decomposes multi-frame video queries into fine-grained operations and offloads the bulk of the processing to efficient relational query execution and vector similarity search. We demonstrate that LazyVLM provides a robust, efficient, and user-friendly solution for querying open-domain video data at scale.

LazyVLM: Neuro-Symbolic Approach to Video Analytics

TL;DR

LazyVLM tackles scalable open-domain video moment retrieval by replacing monolithic end-to-end VLM inference with a neuro-symbolic pipeline that uses semi-structured SPO queries and precomputed scene graphs. It decomposes queries into semantic search, relational (symbolic) verification, and lightweight VLM refinements, enabling parallel execution and substantial reduction in costly video-scale processing, relative to traditional VLMs with context complexity. Key contributions include a SPO-based query interface, Entity and Relationship Stores with frame-level embeddings, and a four-stage query processing pipeline (Entity Matching, SQL generation, Relationship Matching, Temporal Matching) that supports incremental updates. The approach enables scalable, accurate, and user-friendly open-domain video analytics at scale with practical impact for real-world video data workflows.

Abstract

Current video analytics approaches face a fundamental trade-off between flexibility and efficiency. End-to-end Vision Language Models (VLMs) often struggle with long-context processing and incur high computational costs, while neural-symbolic methods depend heavily on manual labeling and rigid rule design. In this paper, we introduce LazyVLM, a neuro-symbolic video analytics system that provides a user-friendly query interface similar to VLMs, while addressing their scalability limitation. LazyVLM enables users to effortlessly drop in video data and specify complex multi-frame video queries using a semi-structured text interface for video analytics. To address the scalability limitations of VLMs, LazyVLM decomposes multi-frame video queries into fine-grained operations and offloads the bulk of the processing to efficient relational query execution and vector similarity search. We demonstrate that LazyVLM provides a robust, efficient, and user-friendly solution for querying open-domain video data at scale.

Paper Structure

This paper contains 7 sections, 2 figures.

Figures (2)

  • Figure 1: Overview of query processing in LazyVLM. The diagram illustrates the processing of a semi-structured text query, which includes entity descriptions (e.g., "man in red") and relationship terms (e.g., "near"). The query is processed through a sequence of stages: entity matching via vector similarity search, SQL query processing to retrieve candidate relationships, relationship verification using a VLM to refine results, and temporal matching to identify the final set of video segments.
  • Figure 2: Pipeline of user interactions in LazyVLM for specifying and executing a video query: Step ❶: Load Dataset and Enter Hyperparameters; Step ❷: Enter Entities; Step ❸: Enter Relationships; Step ❹: Enter Triples; Step ❺: Enter Frames and Temporal Constraints; and Step ❻: Query Execution and Presentation of Results.