Table of Contents
Fetching ...

Temporally Grounding Instructional Diagrams in Unconstrained Videos

Jiahao Zhang, Frederic Z. Zhang, Cristian Rodriguez, Yizhak Ben-Shabat, Anoop Cherian, Stephen Gould

TL;DR

The insight is that self-attention among composite queries carrying different content features suppress each other to reduce timespan overlaps in predictions, while the cross-attention corrects the temporal misalignment via content and position joint guidance.

Abstract

We study the challenging problem of simultaneously localizing a sequence of queries in the form of instructional diagrams in a video. This requires understanding not only the individual queries but also their interrelationships. However, most existing methods focus on grounding one query at a time, ignoring the inherent structures among queries such as the general mutual exclusiveness and the temporal order. Consequently, the predicted timespans of different step diagrams may overlap considerably or violate the temporal order, thus harming the accuracy. In this paper, we tackle this issue by simultaneously grounding a sequence of step diagrams. Specifically, we propose composite queries, constructed by exhaustively pairing up the visual content features of the step diagrams and a fixed number of learnable positional embeddings. Our insight is that self-attention among composite queries carrying different content features suppress each other to reduce timespan overlaps in predictions, while the cross-attention corrects the temporal misalignment via content and position joint guidance. We demonstrate the effectiveness of our approach on the IAW dataset for grounding step diagrams and the YouCook2 benchmark for grounding natural language queries, significantly outperforming existing methods while simultaneously grounding multiple queries.

Temporally Grounding Instructional Diagrams in Unconstrained Videos

TL;DR

The insight is that self-attention among composite queries carrying different content features suppress each other to reduce timespan overlaps in predictions, while the cross-attention corrects the temporal misalignment via content and position joint guidance.

Abstract

We study the challenging problem of simultaneously localizing a sequence of queries in the form of instructional diagrams in a video. This requires understanding not only the individual queries but also their interrelationships. However, most existing methods focus on grounding one query at a time, ignoring the inherent structures among queries such as the general mutual exclusiveness and the temporal order. Consequently, the predicted timespans of different step diagrams may overlap considerably or violate the temporal order, thus harming the accuracy. In this paper, we tackle this issue by simultaneously grounding a sequence of step diagrams. Specifically, we propose composite queries, constructed by exhaustively pairing up the visual content features of the step diagrams and a fixed number of learnable positional embeddings. Our insight is that self-attention among composite queries carrying different content features suppress each other to reduce timespan overlaps in predictions, while the cross-attention corrects the temporal misalignment via content and position joint guidance. We demonstrate the effectiveness of our approach on the IAW dataset for grounding step diagrams and the YouCook2 benchmark for grounding natural language queries, significantly outperforming existing methods while simultaneously grounding multiple queries.
Paper Structure (11 sections, 1 equation, 4 figures, 8 tables)

This paper contains 11 sections, 1 equation, 4 figures, 8 tables.

Figures (4)

  • Figure 9: Kernel density estimate (KDE) plots of normalized start and end time distribution for different datasets, where the locations of the ground truth time span in YouCookII and IAW are more uniformly spread through the video compared with Charades STA, ActivityNet Caption and TACoS. Besides that, there are strong location priors illustrated as left bottom corner blob shown in (\ref{['fig:charades_val']}, \ref{['fig:anet_val']}, \ref{['fig:tacos_val']}, \ref{['fig:tacos_test']}). In contrast to the IAW and YouCookII, where the blob is actually spread through all the video sequence.
  • Figure 10: Visualization of last layer self-attention among composite queries for the same example shown in cross-attention visualization the main paper. The index on the axis denotes the corresponding composite query, e.g., 0 means composite query (1, 1) with diagram 1 and learnable query 1, 3 represents (2, 1), 7 denotes (3, 2) and so on. Composite queries 2, 4 and 7 get the highest scores at the end.
  • Figure 11: Qualitative result of two successful examples. The horizontal axis represents the timeline of the video. Each row corresponds to a step diagram where the solid rectangle denotes the ground truth and top-1 time span predictions are represented by bounding boxes with the same color.
  • Figure 12: Qualitative result of two failed examples. The horizontal axis represents the timeline of the video. Each row corresponds to a step diagram where the solid rectangle denotes the ground truth, and the top-1 time span predictions are represented by bounding boxes with the same color.