Detection-Fusion for Knowledge Graph Extraction from Videos
Taniya Das, Louis Mahon, Thomas Lukasiewicz
TL;DR
The paper tackles semantic video understanding by replacing natural-language captions with structured knowledge graphs extracted from videos. It introduces a two-stage deep-learning framework that first detects individuals and predicates, then fuses predictions into a knowledge graph, with an optional background-knowledge module using Visual Genome priors. The approach shows superior performance over prior video KG methods on MSVD* and MSRVTT* datasets, and ablations highlight the importance of the combining framework and the trade-off between candidate facts and runtime. By enabling language-agnostic, easily evaluable representations and exploring commonsense integration, the work advances practical video understanding and knowledge graph construction. The results suggest potential for broader applications and future extensions to richer commonsense sources and other input domains.
Abstract
One of the challenging tasks in the field of video understanding is extracting semantic content from video inputs. Most existing systems use language models to describe videos in natural language sentences, but this has several major shortcomings. Such systems can rely too heavily on the language model component and base their output on statistical regularities in natural language text rather than on the visual contents of the video. Additionally, natural language annotations cannot be readily processed by a computer, are difficult to evaluate with performance metrics and cannot be easily translated into a different natural language. In this paper, we propose a method to annotate videos with knowledge graphs, and so avoid these problems. Specifically, we propose a deep-learning-based model for this task that first predicts pairs of individuals and then the relations between them. Additionally, we propose an extension of our model for the inclusion of background knowledge in the construction of knowledge graphs.
