From Videos to Indexed Knowledge Graphs -- Framework to Marry Methods for Multimodal Content Analysis and Understanding
Basem Rizk, Joel Walsh, Mark Core, Benjamin Nye
TL;DR
The paper tackles the challenge of multimodal video analysis by introducing a framework that orchestrates pre-trained models into modular pipelines, converting videos into temporal semi-structured data (VideoKnowledgeBases) and then into frame-level, queryable VideoKnowledgeGraphs. The approach centers on DataWindow constructs, a flexible Pipeline framework, and a full recipe that spans transcription, keyframe extraction, OCR, object tagging, dense captioning, and scene-graph extraction, followed by knowledge graph construction with WordNet-based relations and domain extensions via VirtualSynsets. It enables continual learning and interactive knowledge extension, allowing users to add domain-specific annotations and fine-tuned mini-classifiers that enrich the graphs over time. The work demonstrates a practical path toward scalable, queryable video knowledge representations with potential applications in multimodal LLM data generation, retrieval, and AR-driven intelligent agents.
Abstract
Analysis of multi-modal content can be tricky, computationally expensive, and require a significant amount of engineering efforts. Lots of work with pre-trained models on static data is out there, yet fusing these opensource models and methods with complex data such as videos is relatively challenging. In this paper, we present a framework that enables efficiently prototyping pipelines for multi-modal content analysis. We craft a candidate recipe for a pipeline, marrying a set of pre-trained models, to convert videos into a temporal semi-structured data format. We translate this structure further to a frame-level indexed knowledge graph representation that is query-able and supports continual learning, enabling the dynamic incorporation of new domain-specific knowledge through an interactive medium.
