Table of Contents
Fetching ...

From Videos to Indexed Knowledge Graphs -- Framework to Marry Methods for Multimodal Content Analysis and Understanding

Basem Rizk, Joel Walsh, Mark Core, Benjamin Nye

TL;DR

The paper tackles the challenge of multimodal video analysis by introducing a framework that orchestrates pre-trained models into modular pipelines, converting videos into temporal semi-structured data (VideoKnowledgeBases) and then into frame-level, queryable VideoKnowledgeGraphs. The approach centers on DataWindow constructs, a flexible Pipeline framework, and a full recipe that spans transcription, keyframe extraction, OCR, object tagging, dense captioning, and scene-graph extraction, followed by knowledge graph construction with WordNet-based relations and domain extensions via VirtualSynsets. It enables continual learning and interactive knowledge extension, allowing users to add domain-specific annotations and fine-tuned mini-classifiers that enrich the graphs over time. The work demonstrates a practical path toward scalable, queryable video knowledge representations with potential applications in multimodal LLM data generation, retrieval, and AR-driven intelligent agents.

Abstract

Analysis of multi-modal content can be tricky, computationally expensive, and require a significant amount of engineering efforts. Lots of work with pre-trained models on static data is out there, yet fusing these opensource models and methods with complex data such as videos is relatively challenging. In this paper, we present a framework that enables efficiently prototyping pipelines for multi-modal content analysis. We craft a candidate recipe for a pipeline, marrying a set of pre-trained models, to convert videos into a temporal semi-structured data format. We translate this structure further to a frame-level indexed knowledge graph representation that is query-able and supports continual learning, enabling the dynamic incorporation of new domain-specific knowledge through an interactive medium.

From Videos to Indexed Knowledge Graphs -- Framework to Marry Methods for Multimodal Content Analysis and Understanding

TL;DR

The paper tackles the challenge of multimodal video analysis by introducing a framework that orchestrates pre-trained models into modular pipelines, converting videos into temporal semi-structured data (VideoKnowledgeBases) and then into frame-level, queryable VideoKnowledgeGraphs. The approach centers on DataWindow constructs, a flexible Pipeline framework, and a full recipe that spans transcription, keyframe extraction, OCR, object tagging, dense captioning, and scene-graph extraction, followed by knowledge graph construction with WordNet-based relations and domain extensions via VirtualSynsets. It enables continual learning and interactive knowledge extension, allowing users to add domain-specific annotations and fine-tuned mini-classifiers that enrich the graphs over time. The work demonstrates a practical path toward scalable, queryable video knowledge representations with potential applications in multimodal LLM data generation, retrieval, and AR-driven intelligent agents.

Abstract

Analysis of multi-modal content can be tricky, computationally expensive, and require a significant amount of engineering efforts. Lots of work with pre-trained models on static data is out there, yet fusing these opensource models and methods with complex data such as videos is relatively challenging. In this paper, we present a framework that enables efficiently prototyping pipelines for multi-modal content analysis. We craft a candidate recipe for a pipeline, marrying a set of pre-trained models, to convert videos into a temporal semi-structured data format. We translate this structure further to a frame-level indexed knowledge graph representation that is query-able and supports continual learning, enabling the dynamic incorporation of new domain-specific knowledge through an interactive medium.

Paper Structure

This paper contains 12 sections, 4 figures.

Figures (4)

  • Figure 1: Pre-pipeline: An illustration of an example of a 'DataWindowGenerator'. This DataWindowGenerator in the figure particularly accepts a video, transcribes it, and segments the video on the basis of the transcription paragraphs. Those paragraphs are constructed utilizing a greedy approach using coherency scores. It yields DataWindows that packs aligned segments of frames's images with corresponding segments of coherent segments of transcription.
  • Figure 2: Post-pipeline: Abstract illustration of how a 'DataWindowConsumer' writing the DataWindows of a video into a semi-structured format which we call VideoKnowledgeBase. That is to be utilized by downstream tasks (e.g., video type classification, information retrieval, generating knowledge graphs).
  • Figure 3: An illustration of our pipeline recipe and what we aim to achieve by each of employing a combination of pre-trained models. The pipeline transforms video data into a semi-structured knowledge base, begining with extracting keyframes from the video and then applying various computer vision techniques, such as OCR, image tagging, and dense captioning. The resulting information is then processed to extract relationships between objects and entities, which are used to construct a knowledge graph.
  • Figure 4: This figure illustrates a sample query, "a sovermenny ship in the middle of the sea", knowledge graph, representing the concept $ship$, its learned concept $sovermenny.ship.virtual.n.01$ and their relationships. The graph showcases the hierarchical structure that is used to query against the database of VideosKnoweldgeGraphs.