TDViT: Temporal Dilated Video Transformer for Dense Video Tasks

Guanxiong Sun; Yang Hua; Guosheng Hu; Neil Robertson

TDViT: Temporal Dilated Video Transformer for Dense Video Tasks

Guanxiong Sun, Yang Hua, Guosheng Hu, Neil Robertson

TL;DR

TDViT tackles dense video tasks that require per-frame predictions by introducing Temporal Dilated Transformer Blocks (TDTB) that incorporate a memory of past frames and a temporal dilation mechanism. By stacking TDTBs in a hierarchical fashion, TDViT exponentially expands the temporal receptive field, enabling effective long-range temporal modeling with single-frame computation. The approach achieves strong accuracy and favorable speed-accuracy trade-offs on ImageNet VID and YouTube VIS, and demonstrates compatibility with state-of-the-art detection and segmentation frameworks. Overall, TDViT provides a compact, end-to-end transformer backbone for dense video tasks with practical applicability and competitive performance.

Abstract

Deep video models, for example, 3D CNNs or video transformers, have achieved promising performance on sparse video tasks, i.e., predicting one result per video. However, challenges arise when adapting existing deep video models to dense video tasks, i.e., predicting one result per frame. Specifically, these models are expensive for deployment, less effective when handling redundant frames, and difficult to capture long-range temporal correlations. To overcome these issues, we propose a Temporal Dilated Video Transformer (TDViT) that consists of carefully designed temporal dilated transformer blocks (TDTB). TDTB can efficiently extract spatiotemporal representations and effectively alleviate the negative effect of temporal redundancy. Furthermore, by using hierarchical TDTBs, our approach obtains an exponentially expanded temporal receptive field and therefore can model long-range dynamics. Extensive experiments are conducted on two different dense video benchmarks, i.e., ImageNet VID for video object detection and YouTube VIS for video instance segmentation. Excellent experimental results demonstrate the superior efficiency, effectiveness, and compatibility of our method. The code is available at https://github.com/guanxiongsun/vfe.pytorch.

TDViT: Temporal Dilated Video Transformer for Dense Video Tasks

TL;DR

Abstract

Paper Structure (28 sections, 1 equation, 5 figures, 9 tables)

This paper contains 28 sections, 1 equation, 5 figures, 9 tables.

Introduction
Related Work
Video Models for Sparse Video Tasks.
Dense Video Tasks.
Vision Transformers.
Method
Overall Architecture
Temporal Dilated Transformer Block
Memory Sampling and Feature Reuse.
Efficient Local Attentions.
Spatiotemporal Attention Schemes
Temporal Receptive Field of Hierarchical TDTB
Architecture and Variants
Experiments
Video Object Detection Setup.
...and 13 more sections

Figures (5)

Figure 1: Architecture (a) is widely used in sparse video tasks. 3D models, e.g., 3D CNNs, take multiple frames as input and generate one output by averaging the spatiotemporal representations. Architecture (b) is used for dense video tasks. Considering the computational cost, 2D models are used to extract features of independent frames, and then temporal modules, e.g., correlation filters, are leveraged to model spatiotemporal correspondences. Our TDViT (c) is designed for dense video tasks, which can efficiently and effectively extract spatiotemporal representations using temporal dilated transformer blocks (TDTB). Best viewed in colour.
Figure 2: (a) Overview. TDViT contains four stages which consist of several temporal dilated transformer blocks (TDTB). A memory structure (purple cuboids) is introduced into the TDTB, which stores features of previous frames (yellow rectangles) and enables our approach to dynamically establish temporal connections. The temporal dilation factor $D_t$ is used to control the memory sampling process and reduce the video redundancy. (b) Details of a TDTB. For every time step, the query tokens are from the current frame $I_t$ but the key and value tokens are derived from memory sampling. Finally, the memory structure saves the output features and deletes the oldest features. Best viewed in colour.
Figure 3: Illustration of memory sampling. The red cuboid denotes the sampled features and the grey cuboid denotes stored features in memory. Best viewed in colour.
Figure 4: (a) Window and (b) correlation based local attentions. The blue rectangle denotes one query token, the red dotted rectangle denotes the range of participated key tokens. (c) Split and (d) factorised spatiotemporal schemes. The blue and red boxes denote space-only self-attention and TDTB, respectively. Best viewed in colour.
Figure 5: Illustration of how hierarchical TDTBs increase the temporal receptive field. (a) and (b) show the framework of RDN rdn and our TDViT, respectively. The red and blue rectangles denote the current frame and frames within the temporal receptive field, respectively. Best viewed in colour.

TDViT: Temporal Dilated Video Transformer for Dense Video Tasks

TL;DR

Abstract

TDViT: Temporal Dilated Video Transformer for Dense Video Tasks

Authors

TL;DR

Abstract

Table of Contents

Figures (5)