Self-Enhancing Video Data Management System for Compositional Events with Large Language Models [Technical Report]

Enhao Zhang; Nicole Sullivan; Brandon Haynes; Ranjay Krishna; Magdalena Balazinska

Self-Enhancing Video Data Management System for Compositional Events with Large Language Models [Technical Report]

Enhao Zhang, Nicole Sullivan, Brandon Haynes, Ranjay Krishna, Magdalena Balazinska

TL;DR

VOCAL-UDF addresses the challenge of answering compositional video queries without a fixed set of modules by introducing a self-enhancing VDBMS that automatically generates new UDFs using large language models. It supports both program-based Python predicates and distilled-model lightweight vision models, validated via syntax and semantic checks, and refined through active learning to align with user intent. The approach yields significant F1 improvements across multiple datasets while reducing reliance on expensive, monolithic LLM queries, enabling more cost-effective and scalable video analytics. By unifying UDF definitions within a single framework and enabling on-demand module growth, VOCAL-UDF advances practical, domain-adaptive video querying with broad applicability in surveillance, education, and robotics.

Abstract

Complex video queries can be answered by decomposing them into modular subtasks. However, existing video data management systems assume the existence of predefined modules for each subtask. We introduce VOCAL-UDF, a novel self-enhancing system that supports compositional queries over videos without the need for predefined modules. VOCAL-UDF automatically identifies and constructs missing modules and encapsulates them as user-defined functions (UDFs), thus expanding its querying capabilities. To achieve this, we formulate a unified UDF model that leverages large language models (LLMs) to aid in new UDF generation. VOCAL-UDF handles a wide range of concepts by supporting both program-based UDFs (i.e., Python functions generated by LLMs) and distilled-model UDFs (lightweight vision models distilled from strong pretrained models). To resolve the inherent ambiguity in user intent, VOCAL-UDF generates multiple candidate UDFs and uses active learning to efficiently select the best one. With the self-enhancing capability, VOCAL-UDF significantly improves query performance across three video datasets.

Self-Enhancing Video Data Management System for Compositional Events with Large Language Models [Technical Report]

TL;DR

Abstract

Paper Structure (31 sections, 15 figures, 11 tables, 1 algorithm)

This paper contains 31 sections, 15 figures, 11 tables, 1 algorithm.

Introduction
Background
A UDF-based data model
VOCAL-UDF approach
Query parsing and UDF proposal
Program-based UDF generation
Generating candidate programs using an LLM
Syntax verification
Distilled-model UDF generation
Image sampling
Data labeling
Model training
UDF Selection
UDF selection and active learning
Dummy UDFs
...and 16 more sections

Figures (15)

Figure 1: Given a video dataset and a user query in natural language, VOCAL-UDF 1 parses the query into a DSL notation. If the query contains predicates that existing UDFs cannot answer, VOCAL-UDF 2 automatically builds new UDFs, 3 updates its available UDF list, 4 reparses the query, and 5 executes the query to return matching video segments. VOCAL-UDF supports both program-based UDFs (i.e., Python functions) and distilled-model UDFs (i.e., ML models) for diverse concepts.
Figure 2: VOCAL-UDF system overview.
Figure 3: Program candidates, using behind as an example.
Figure 4: Data labeling by a VLM, using behind as an example.
Figure 5: End-to-end performance (F1 score, precision, recall) of generated queries with various number of missing UDFs.
...and 10 more figures

Self-Enhancing Video Data Management System for Compositional Events with Large Language Models [Technical Report]

TL;DR

Abstract

Self-Enhancing Video Data Management System for Compositional Events with Large Language Models [Technical Report]

Authors

TL;DR

Abstract

Table of Contents

Figures (15)