Unified and Dynamic Graph for Temporal Character Grouping in Long Videos

Xiujun Shu; Wei Wen; Liangsheng Xu; Ruizhi Qiao; Taian Guo; Hanjun Li; Bei Gan; Xiao Wang; Xing Sun

Unified and Dynamic Graph for Temporal Character Grouping in Long Videos

Xiujun Shu, Wei Wen, Liangsheng Xu, Ruizhi Qiao, Taian Guo, Hanjun Li, Bei Gan, Xiao Wang, Xing Sun

TL;DR

This paper presents a unified and dynamic graph (UniDG) framework for temporal character grouping by creating a unified representation network that learns representations of multiple modalities within the same space and still preserves the modality's uniqueness simultaneously.

Abstract

Video temporal character grouping locates appearing moments of major characters within a video according to their identities. To this end, recent works have evolved from unsupervised clustering to graph-based supervised clustering. However, graph methods are built upon the premise of fixed affinity graphs, bringing many inexact connections. Besides, they extract multi-modal features with kinds of models, which are unfriendly to deployment. In this paper, we present a unified and dynamic graph (UniDG) framework for temporal character grouping. This is accomplished firstly by a unified representation network that learns representations of multiple modalities within the same space and still preserves the modality's uniqueness simultaneously. Secondly, we present a dynamic graph clustering where the neighbors of different quantities are dynamically constructed for each node via a cyclic matching strategy, leading to a more reliable affinity graph. Thirdly, a progressive association method is introduced to exploit spatial and temporal contexts among different modalities, allowing multi-modal clustering results to be well fused. As current datasets only provide pre-extracted features, we evaluate our UniDG method on a collected dataset named MTCG, which contains each character's appearing clips of face and body and speaking voice tracks. We also evaluate our key components on existing clustering and retrieval datasets to verify the generalization ability. Experimental results manifest that our method can achieve promising results and outperform several state-of-the-art approaches.

Unified and Dynamic Graph for Temporal Character Grouping in Long Videos

TL;DR

Abstract

Paper Structure (21 sections, 8 equations, 12 figures, 9 tables)

This paper contains 21 sections, 8 equations, 12 figures, 9 tables.

Introduction
Related Work
Representation Learning
Clustering Algorithms
Methodology
Preliminaries
Unified Representation Network
Dynamic Graph Clustering
Progressive Association
The MTCG Dataset
Statistical Data
Annotation Process
Experiments
Implementation Details
Main results
...and 6 more sections

Figures (12)

Figure 1: Off-the-shelf video character-grouping methods v.s. our UniDG in this paper. (a) We learn representations across modalities in the same space. (b) We construct dynamic neighbors for more reliable affinity graphs.
Figure 2: The definition of multi-modal temporal character grouping in long videos. The whole pipeline in real applications contains four stages and we focus on the stage of temporal character grouping. The long videos usually last tens of minutes to hours. By fully exploring multiple modalities, i.e., face, body, and voice, we can obtain the start and end time of appearing moments for major characters (1, 2, 3, etc.) in each video. Multiple modalities can provide complementary cues, e.g., the girl's face in the 2nd image is not visible, but her body can recall corresponding moments. The voices are used to assist us in obtaining more compact grouping results.
Figure 3: Architecture of the proposed UniDG Framework. The framework consists of three modules: Unified Representation Network, Dynamic Graph Clustering, and Progressive Association. The input consists of three modalities, i.e., face, body, and voice.
Figure 4: Graph building time at different data sizes. The experiments are performed on a large-scale face dataset Web-Face42M 2021WebFace260M.
Figure 5: Spatial association. The spatial IOU is leveraged to match the face and body. As the face and body belong to one specific frame, we need to match them in only that frame, and the matching is very fast.
...and 7 more figures

Unified and Dynamic Graph for Temporal Character Grouping in Long Videos

TL;DR

Abstract

Unified and Dynamic Graph for Temporal Character Grouping in Long Videos

Authors

TL;DR

Abstract

Table of Contents

Figures (12)