Table of Contents
Fetching ...

Multi-Scale Temporal Difference Transformer for Video-Text Retrieval

Ni Wang, Dongliang Liao, Xing Xu

TL;DR

A transformer variant named Multi-Scale Temporal Difference Transformer (MSTDT), which mainly addresses the defects of the traditional transformer which has limited ability to capture local temporal information and proposes a new loss to narrow the distance of similar samples.

Abstract

Currently, in the field of video-text retrieval, there are many transformer-based methods. Most of them usually stack frame features and regrade frames as tokens, then use transformers for video temporal modeling. However, they commonly neglect the inferior ability of the transformer modeling local temporal information. To tackle this problem, we propose a transformer variant named Multi-Scale Temporal Difference Transformer (MSTDT). MSTDT mainly addresses the defects of the traditional transformer which has limited ability to capture local temporal information. Besides, in order to better model the detailed dynamic information, we make use of the difference feature between frames, which practically reflects the dynamic movement of a video. We extract the inter-frame difference feature and integrate the difference and frame feature by the multi-scale temporal transformer. In general, our proposed MSTDT consists of a short-term multi-scale temporal difference transformer and a long-term temporal transformer. The former focuses on modeling local temporal information, the latter aims at modeling global temporal information. At last, we propose a new loss to narrow the distance of similar samples. Extensive experiments show that backbone, such as CLIP, with MSTDT has attained a new state-of-the-art result.

Multi-Scale Temporal Difference Transformer for Video-Text Retrieval

TL;DR

A transformer variant named Multi-Scale Temporal Difference Transformer (MSTDT), which mainly addresses the defects of the traditional transformer which has limited ability to capture local temporal information and proposes a new loss to narrow the distance of similar samples.

Abstract

Currently, in the field of video-text retrieval, there are many transformer-based methods. Most of them usually stack frame features and regrade frames as tokens, then use transformers for video temporal modeling. However, they commonly neglect the inferior ability of the transformer modeling local temporal information. To tackle this problem, we propose a transformer variant named Multi-Scale Temporal Difference Transformer (MSTDT). MSTDT mainly addresses the defects of the traditional transformer which has limited ability to capture local temporal information. Besides, in order to better model the detailed dynamic information, we make use of the difference feature between frames, which practically reflects the dynamic movement of a video. We extract the inter-frame difference feature and integrate the difference and frame feature by the multi-scale temporal transformer. In general, our proposed MSTDT consists of a short-term multi-scale temporal difference transformer and a long-term temporal transformer. The former focuses on modeling local temporal information, the latter aims at modeling global temporal information. At last, we propose a new loss to narrow the distance of similar samples. Extensive experiments show that backbone, such as CLIP, with MSTDT has attained a new state-of-the-art result.

Paper Structure

This paper contains 28 sections, 8 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Failed cases of CLIP4Clip. CLIP4Clip can correctly model the context of the entire video, such as "cook in kitchen” and "woman asks a man", but it is not good at short-term fine-grained actions and subtle scenes, such as actions colored with purple.
  • Figure 2: Overview of the proposed Multi-Scale Temporal Difference Transformer (MSTDT) and CLIP with Multi-Scale Temporal Difference Transformer (CLIP-MSTDT).
  • Figure 3: Examples illustrating the purpose of binary similarity loss.
  • Figure 4: Visualization of Our method and CLIP4Clip on MST-VTT dataset.
  • Figure 5: Comparison of different settings on trade-off rate (a) and transformer layers (b).