VrdONE: One-stage Video Visual Relation Detection

Xinjie Jiang; Chenxi Zheng; Xuemiao Xu; Bangzhen Liu; Weiying Zheng; Huaidong Zhang; Shengfeng He

VrdONE: One-stage Video Visual Relation Detection

Xinjie Jiang, Chenxi Zheng, Xuemiao Xu, Bangzhen Liu, Weiying Zheng, Huaidong Zhang, Shengfeng He

TL;DR

This work proposes VrdONE, a streamlined yet efficacious one-stage model that achieves state-of-the-art performances on the VidOR benchmark and ImageNet-VidVRD, showcasing its superior capability in discerning relations across different temporal scales.

Abstract

Video Visual Relation Detection (VidVRD) focuses on understanding how entities interact over time and space in videos, a key step for gaining deeper insights into video scenes beyond basic visual tasks. Traditional methods for VidVRD, challenged by its complexity, typically split the task into two parts: one for identifying what relation categories are present and another for determining their temporal boundaries. This split overlooks the inherent connection between these elements. Addressing the need to recognize entity pairs' spatiotemporal interactions across a range of durations, we propose VrdONE, a streamlined yet efficacious one-stage model. VrdONE combines the features of subjects and objects, turning predicate detection into 1D instance segmentation on their combined representations. This setup allows for both relation category identification and binary mask generation in one go, eliminating the need for extra steps like proposal generation or post-processing. VrdONE facilitates the interaction of features across various frames, adeptly capturing both short-lived and enduring relations. Additionally, we introduce the Subject-Object Synergy (SOS) module, enhancing how subjects and objects perceive each other before combining. VrdONE achieves state-of-the-art performances on the VidOR benchmark and ImageNet-VidVRD, showcasing its superior capability in discerning relations across different temporal scales. The code is available at https://github.com/lucaspk512/vrdone.

VrdONE: One-stage Video Visual Relation Detection

TL;DR

Abstract

Paper Structure (20 sections, 13 equations, 7 figures, 10 tables)

This paper contains 20 sections, 13 equations, 7 figures, 10 tables.

Introduction
Related Work
Methods
Preliminaries
Overview
Bilateral Spatiotemporal Aggregation
One-stage Relation Detector
Training and Inference
Experiments
Comparison with State-of-the-Arts
Ablation Studies
Qualitative Results
Conclusion
Overview
Fusing Additional Features
...and 5 more sections

Figures (7)

Figure 1: Limitations of existing two-stage methods. The heuristic aggregation in classification-based methods can lead to incorrect temporal localizations, causing (a) consecutive relations to be mistakenly identified as a single relation and (b) long-lasting relations improperly split into shorter segments. Localization-based methods also have drawbacks, where (c) relations might go undetected during inference due to mismatches with the fixed-length proposals.
Figure 2: The mean duration distribution of the top 5 head and tail relations in the VidOR dataset, showing significant variation in their lengths.
Figure 3: The VrdONE pipeline processes an untrimmed video by first extracting visual and spatial change features ($f$ and $\theta$) for all entities' tracklets using a frozen pretrained video tracker. For each subject-object pair, the BSA then encapsulates bilateral awareness into the feature embeddings. Absolute positional changes ($\theta^a$) are injected into entity features, followed by $L$ SOS modules to enrich spatiotemporal interactions. After equipping these enriched embeddings with relative spatial features $\theta^{r}$, the resulting unified embeddings $e_{so}$ are further processed by the relation encoder $E_{mul}$ and directed to two synergistic decoder $D_{msk}$ and $D_{rel}$. With the help of the generated temporal-aware features $z_{msk}$ and category-aware features $z_{cls}$, VrdONE achieves one-stage processing for both video relation classification and temporal localization.
Figure 4: Visualization of video relation detection results with open-source methods on the VidOR dataset (top) and ImageNet-VidVRD dataset (bottom). The $\surd$, $\times$, and $\bigcirc$ represent correct, incorrect, and missing detection instances, respectively.
Figure 5: Analysis of varying weight factors $\lambda_{cls}$, $\lambda_{mf}$, and $\lambda_{md}$, which are applied to (a) Cross Entropy Loss for relation classification, (b) Mask Focal Loss and (c) Dice Loss for relation localization, respectively.
...and 2 more figures

VrdONE: One-stage Video Visual Relation Detection

TL;DR

Abstract

VrdONE: One-stage Video Visual Relation Detection

Authors

TL;DR

Abstract

Table of Contents

Figures (7)