TDViT: Temporal Dilated Video Transformer for Dense Video Tasks
Guanxiong Sun, Yang Hua, Guosheng Hu, Neil Robertson
TL;DR
TDViT tackles dense video tasks that require per-frame predictions by introducing Temporal Dilated Transformer Blocks (TDTB) that incorporate a memory of past frames and a temporal dilation mechanism. By stacking TDTBs in a hierarchical fashion, TDViT exponentially expands the temporal receptive field, enabling effective long-range temporal modeling with single-frame computation. The approach achieves strong accuracy and favorable speed-accuracy trade-offs on ImageNet VID and YouTube VIS, and demonstrates compatibility with state-of-the-art detection and segmentation frameworks. Overall, TDViT provides a compact, end-to-end transformer backbone for dense video tasks with practical applicability and competitive performance.
Abstract
Deep video models, for example, 3D CNNs or video transformers, have achieved promising performance on sparse video tasks, i.e., predicting one result per video. However, challenges arise when adapting existing deep video models to dense video tasks, i.e., predicting one result per frame. Specifically, these models are expensive for deployment, less effective when handling redundant frames, and difficult to capture long-range temporal correlations. To overcome these issues, we propose a Temporal Dilated Video Transformer (TDViT) that consists of carefully designed temporal dilated transformer blocks (TDTB). TDTB can efficiently extract spatiotemporal representations and effectively alleviate the negative effect of temporal redundancy. Furthermore, by using hierarchical TDTBs, our approach obtains an exponentially expanded temporal receptive field and therefore can model long-range dynamics. Extensive experiments are conducted on two different dense video benchmarks, i.e., ImageNet VID for video object detection and YouTube VIS for video instance segmentation. Excellent experimental results demonstrate the superior efficiency, effectiveness, and compatibility of our method. The code is available at https://github.com/guanxiongsun/vfe.pytorch.
