Detection-Fusion for Knowledge Graph Extraction from Videos

Taniya Das; Louis Mahon; Thomas Lukasiewicz

Detection-Fusion for Knowledge Graph Extraction from Videos

Taniya Das, Louis Mahon, Thomas Lukasiewicz

TL;DR

The paper tackles semantic video understanding by replacing natural-language captions with structured knowledge graphs extracted from videos. It introduces a two-stage deep-learning framework that first detects individuals and predicates, then fuses predictions into a knowledge graph, with an optional background-knowledge module using Visual Genome priors. The approach shows superior performance over prior video KG methods on MSVD* and MSRVTT* datasets, and ablations highlight the importance of the combining framework and the trade-off between candidate facts and runtime. By enabling language-agnostic, easily evaluable representations and exploring commonsense integration, the work advances practical video understanding and knowledge graph construction. The results suggest potential for broader applications and future extensions to richer commonsense sources and other input domains.

Abstract

One of the challenging tasks in the field of video understanding is extracting semantic content from video inputs. Most existing systems use language models to describe videos in natural language sentences, but this has several major shortcomings. Such systems can rely too heavily on the language model component and base their output on statistical regularities in natural language text rather than on the visual contents of the video. Additionally, natural language annotations cannot be readily processed by a computer, are difficult to evaluate with performance metrics and cannot be easily translated into a different natural language. In this paper, we propose a method to annotate videos with knowledge graphs, and so avoid these problems. Specifically, we propose a deep-learning-based model for this task that first predicts pairs of individuals and then the relations between them. Additionally, we propose an extension of our model for the inclusion of background knowledge in the construction of knowledge graphs.

Detection-Fusion for Knowledge Graph Extraction from Videos

TL;DR

Abstract

Paper Structure (15 sections, 1 equation, 5 figures, 3 tables)

This paper contains 15 sections, 1 equation, 5 figures, 3 tables.

Introduction
Related Work
Method
Main Model
Inclusion of Background Knowledge
Implementation Details
Experimental Results
Datasets
Main Results
Inclusion of Background Knowledge
Qualitative Results
Ablation Studies
Ablation on combining framework
Effect of the number of candidate facts
Conclusion

Figures (5)

Figure 1: The first frame from MSVD*, with (1) ground-truth natural language captions in MSVD, (2) the ground-truth set of facts in MSVD*, (3)the facts predicted by our model, with (a)objects/subjects present, (b)attributes predicted, (c)relations predicted, and (4) visual representation of the knowledge graph produced
Figure 2: Description of our approach for annotating a video input with knowledge graph using background knowledge as explained in Section\ref{['sec:main_model']}
Figure 3: F1-score vs the number of training epochs for the predicate-MLP, for the main model (left) and extended model (right).
Figure 4: Left: the first two frames from a video in MSVD*, with (1) ground-truth natural language captions in MSVD, (2) the ground-truth set of facts in MSVD*, (3) the facts predicted by the proposed model. Right: the first two frames from MSRVTT*, with (1) ground-truth natural language captions in MSRVTT, (2) the ground-truth set of facts in MSRVTT*, (3)the facts predicted by the proposed model. Left: MSVD*, right: MSRVTT*.
Figure 5: Plots for changes in F1 and time-taken (in min) with changes in the number of candidate facts evaluated. Left: MSVD* dataset, right: MSRVTT* dataset.

Detection-Fusion for Knowledge Graph Extraction from Videos

TL;DR

Abstract

Detection-Fusion for Knowledge Graph Extraction from Videos

Authors

TL;DR

Abstract

Table of Contents

Figures (5)