Table of Contents
Fetching ...

What and When to Look?: Temporal Span Proposal Network for Video Relation Detection

Sangmin Woo, Junhyug Noh, Kangil Kim

TL;DR

A novel approach named Temporal Span Proposal Network (TSPN) is proposed, which accelerates training by 2X or more than existing methods and achieves competitive performance on two VidVRD benchmarks (ImageNet-VidVDR and VidOR).

Abstract

Identifying relations between objects is central to understanding the scene. While several works have been proposed for relation modeling in the image domain, there have been many constraints in the video domain due to challenging dynamics of spatio-temporal interactions (e.g., between which objects are there an interaction? when do relations start and end?). To date, two representative methods have been proposed to tackle Video Visual Relation Detection (VidVRD): segment-based and window-based. We first point out limitations of these methods and propose a novel approach named Temporal Span Proposal Network (TSPN). TSPN tells what to look: it sparsifies relation search space by scoring relationness of object pair, i.e., measuring how probable a relation exist. TSPN tells when to look: it simultaneously predicts start-end timestamps (i.e., temporal spans) and categories of the all possible relations by utilizing full video context. These two designs enable a win-win scenario: it accelerates training by 2X or more than existing methods and achieves competitive performance on two VidVRD benchmarks (ImageNet-VidVDR and VidOR). Moreover, comprehensive ablative experiments demonstrate the effectiveness of our approach. Codes are available at https://github.com/sangminwoo/Temporal-Span-Proposal-Network-VidVRD.

What and When to Look?: Temporal Span Proposal Network for Video Relation Detection

TL;DR

A novel approach named Temporal Span Proposal Network (TSPN) is proposed, which accelerates training by 2X or more than existing methods and achieves competitive performance on two VidVRD benchmarks (ImageNet-VidVDR and VidOR).

Abstract

Identifying relations between objects is central to understanding the scene. While several works have been proposed for relation modeling in the image domain, there have been many constraints in the video domain due to challenging dynamics of spatio-temporal interactions (e.g., between which objects are there an interaction? when do relations start and end?). To date, two representative methods have been proposed to tackle Video Visual Relation Detection (VidVRD): segment-based and window-based. We first point out limitations of these methods and propose a novel approach named Temporal Span Proposal Network (TSPN). TSPN tells what to look: it sparsifies relation search space by scoring relationness of object pair, i.e., measuring how probable a relation exist. TSPN tells when to look: it simultaneously predicts start-end timestamps (i.e., temporal spans) and categories of the all possible relations by utilizing full video context. These two designs enable a win-win scenario: it accelerates training by 2X or more than existing methods and achieves competitive performance on two VidVRD benchmarks (ImageNet-VidVDR and VidOR). Moreover, comprehensive ablative experiments demonstrate the effectiveness of our approach. Codes are available at https://github.com/sangminwoo/Temporal-Span-Proposal-Network-VidVRD.

Paper Structure

This paper contains 30 sections, 14 equations, 4 figures, 8 tables.

Figures (4)

  • Figure 1: Who is handing over the bear to whom? When we try to answer the question with only a single image (leftmost), the answer can be both: left or right man. While guessing relations in short-term video segment (middle) is still questionable, the answer becomes clear in full video (rightmost) thanks to the spatio-temporal contexts. The time sequence is from top to bottom. Answer: right man.
  • Figure 2: Conceptual Comparison of typical VidVRD approaches (empirical comparisons are in Table \ref{['tab:time']}, \ref{['tab:vidvrd']}, \ref{['tab:vidor']}). (a) Segment-based approach first chunks a video into multiple segments, predict the short-term relations within each segment, and then greedily associate the relations of adjacent segments into the long-term relations. (b) Window-based approach generates a set of sub-tracklet pairs via a size-varying sliding window, and then predict all relations with different temporal span. (c) TSPN (ours) jointly predicts relation categories and its temporal span with a single video-level object trajectory pair. $O_i$ stands for $i$-th object trajectory of all object trajectories in the video, and $R_j$ denotes $j$-th relation category. We assume the relations are predicted only for temporal span in which two object trajectories appear simultaneously in the video. We note that illustration of each method may not contain all the detailed procedures.
  • Figure 3: Overview of TSPN.(a) TSPN is built upon the object trajectory proposal head which comprises object detection and tracking stages (Sec. \ref{['sec:3b']}). (b) We first extract video visual features via a CNN backbone he2016deep. With detection results (RoIs), we then extract RoI features xu2017r of subject, object, and union area. Their corresponding bounding box coordinates and class distribution can be naturally obtained from object detection phase (Sec. \ref{['sec:3c']}). (c) The concatenation of RoI features with bounding box coordinates and class distribution ($J$) are linearly transformed and then fused via a Hadamard product (denoted as $\odot$ in the figure), resulting in $H$. The relationness $\mathcal{S}$ between an object pair is first calculated by feeding $H$ into a FC layer. After then, a set of pairs with high relationness scores (colored in red in the figure) is only considered in the subsequent process. Note that the relationness score is computed differently for $\mathcal{S}(O_1 \rightarrow O_2)=0.97$ and $\mathcal{S}(O_2 \rightarrow O_1)=0.53$ since the pair-wise relationship can vary when subject and object are switched (Sec. \ref{['sec:3d']}). (d) Finally, joint features are concatenated and fed to another FC layer to predict output $Z$ which is deemed as an outer product of relation labels $\mathcal{R}$ and their temporal spans $\mathcal{T}$, i.e., start-end time (Sec. \ref{['sec:3e']}). Our TSPN can be trained in an end-to-end manner. See texts for more details.
  • Figure 4: Qualitative examples of visual relation detection results. For comparison, we contrast the predicted relation triplets (i.e., subject-relation-object) of VidVRD with those of TSPN for each given video. The same color means the same object instance. The arrows represent the time axes, providing an approximation of the temporal span of the predicted relation triplets. We highlight relations that TSPN correctly predicted, while VidVRD did not. The predicted relations are considered correct only if the pair of object trajectories have sufficiently high vIoU (i.e., vIoU $>$ 0.5) with ground truth trajectories, and only the correct relations of the top-20 predictions are shown in the figure.