Table of Contents
Fetching ...

Transferable-guided Attention Is All You Need for Video Domain Adaptation

André Sacilotti, Samuel Felipe dos Santos, Nicu Sebe, Jurandy Almeida

TL;DR

This work introduces a novel and effective module, named Domain Transferable-guided Attention Block (DTAB), which compels ViT to focus on the spatio-temporal transferability relationship among video frames by changing the self-attention mechanism to a transferability attention mechanism.

Abstract

Unsupervised domain adaptation (UDA) in videos is a challenging task that remains not well explored compared to image-based UDA techniques. Although vision transformers (ViT) achieve state-of-the-art performance in many computer vision tasks, their use in video UDA has been little explored. Our key idea is to use transformer layers as a feature encoder and incorporate spatial and temporal transferability relationships into the attention mechanism. A Transferable-guided Attention (TransferAttn) framework is then developed to exploit the capacity of the transformer to adapt cross-domain knowledge across different backbones. To improve the transferability of ViT, we introduce a novel and effective module, named Domain Transferable-guided Attention Block (DTAB). DTAB compels ViT to focus on the spatio-temporal transferability relationship among video frames by changing the self-attention mechanism to a transferability attention mechanism. Extensive experiments were conducted on UCF-HMDB, Kinetics-Gameplay, and Kinetics-NEC Drone datasets, with different backbones, like ResNet101, I3D, and STAM, to verify the effectiveness of TransferAttn compared with state-of-the-art approaches. Also, we demonstrate that DTAB yields performance gains when applied to other state-of-the-art transformer-based UDA methods from both video and image domains. Our code is available at https://github.com/Andre-Sacilotti/transferattn-project-code.

Transferable-guided Attention Is All You Need for Video Domain Adaptation

TL;DR

This work introduces a novel and effective module, named Domain Transferable-guided Attention Block (DTAB), which compels ViT to focus on the spatio-temporal transferability relationship among video frames by changing the self-attention mechanism to a transferability attention mechanism.

Abstract

Unsupervised domain adaptation (UDA) in videos is a challenging task that remains not well explored compared to image-based UDA techniques. Although vision transformers (ViT) achieve state-of-the-art performance in many computer vision tasks, their use in video UDA has been little explored. Our key idea is to use transformer layers as a feature encoder and incorporate spatial and temporal transferability relationships into the attention mechanism. A Transferable-guided Attention (TransferAttn) framework is then developed to exploit the capacity of the transformer to adapt cross-domain knowledge across different backbones. To improve the transferability of ViT, we introduce a novel and effective module, named Domain Transferable-guided Attention Block (DTAB). DTAB compels ViT to focus on the spatio-temporal transferability relationship among video frames by changing the self-attention mechanism to a transferability attention mechanism. Extensive experiments were conducted on UCF-HMDB, Kinetics-Gameplay, and Kinetics-NEC Drone datasets, with different backbones, like ResNet101, I3D, and STAM, to verify the effectiveness of TransferAttn compared with state-of-the-art approaches. Also, we demonstrate that DTAB yields performance gains when applied to other state-of-the-art transformer-based UDA methods from both video and image domains. Our code is available at https://github.com/Andre-Sacilotti/transferattn-project-code.
Paper Structure (22 sections, 6 equations, 5 figures, 6 tables)

This paper contains 22 sections, 6 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Main intuition behind our method. In this toy example, some frames in the Jumping class have more meaningful representation about the action, like the pose about to start the jump movement. Also, due to the capability to represent the important steps of the action, such frames can have minor domain shift compared to the rest of the video.
  • Figure 2: A baseline architecture trained with an adversarial loss $\mathcal{L}_{adv}$ and a classification loss $\mathcal{L}_{cls}$. The backbone $G_b$ extracts frame-level features and the encoder $G_e$ learns meaningful semantic spatio-temporal representations.
  • Figure 3: Overview of our TransferAttn. The video frames are fed into a fixed backbone to extract frame-by-frame features, followed by a clip embedding to map frames into tokens. The embeddings are fed into a sequence of transformers to extract transferable spatio-temporal information. The adaptation branch for adversarial domain discrimination uses fine-grained representations from the transformer encoder.
  • Figure 4: DTAB overview. (a) DTAB follows a standard transformer block, except for our novel MDTA mechanism and the layer-wise IB calculation. (b) Heatmap of transferable-attention weights, showing how MDTA focuses on frames that are more transferable between domains and also brings more meaningful information about the action. (c) Temporal attention visualization compared between domains.
  • Figure 5: Ablation study on Kinetics $\rightarrow$ NEC-Drone integrating each component of DTAB separately in comparison with standard transformer. Left: The t-SNE plots for class-wise features. Right: The accuracy result of each component.