Table of Contents
Fetching ...

Long-term Pre-training for Temporal Action Detection with Transformers

Jihwan Kim, Miso Lee, Jae-Pil Heo

TL;DR

This paper identifies two crucial problems from data scarcity: attention collapse and imbalanced performance, and proposes a new pre-training strategy, Long-Term Pre-training (LTP), tailored for transformers, which significantly relieves the data scarcity issues in TAD.

Abstract

Temporal action detection (TAD) is challenging, yet fundamental for real-world video applications. Recently, DETR-based models for TAD have been prevailing thanks to their unique benefits. However, transformers demand a huge dataset, and unfortunately data scarcity in TAD causes a severe degeneration. In this paper, we identify two crucial problems from data scarcity: attention collapse and imbalanced performance. To this end, we propose a new pre-training strategy, Long-Term Pre-training (LTP), tailored for transformers. LTP has two main components: 1) class-wise synthesis, 2) long-term pretext tasks. Firstly, we synthesize long-form video features by merging video snippets of a target class and non-target classes. They are analogous to untrimmed data used in TAD, despite being created from trimmed data. In addition, we devise two types of long-term pretext tasks to learn long-term dependency. They impose long-term conditions such as finding second-to-fourth or short-duration actions. Our extensive experiments show state-of-the-art performances in DETR-based methods on ActivityNet-v1.3 and THUMOS14 by a large margin. Moreover, we demonstrate that LTP significantly relieves the data scarcity issues in TAD.

Long-term Pre-training for Temporal Action Detection with Transformers

TL;DR

This paper identifies two crucial problems from data scarcity: attention collapse and imbalanced performance, and proposes a new pre-training strategy, Long-Term Pre-training (LTP), tailored for transformers, which significantly relieves the data scarcity issues in TAD.

Abstract

Temporal action detection (TAD) is challenging, yet fundamental for real-world video applications. Recently, DETR-based models for TAD have been prevailing thanks to their unique benefits. However, transformers demand a huge dataset, and unfortunately data scarcity in TAD causes a severe degeneration. In this paper, we identify two crucial problems from data scarcity: attention collapse and imbalanced performance. To this end, we propose a new pre-training strategy, Long-Term Pre-training (LTP), tailored for transformers. LTP has two main components: 1) class-wise synthesis, 2) long-term pretext tasks. Firstly, we synthesize long-form video features by merging video snippets of a target class and non-target classes. They are analogous to untrimmed data used in TAD, despite being created from trimmed data. In addition, we devise two types of long-term pretext tasks to learn long-term dependency. They impose long-term conditions such as finding second-to-fourth or short-duration actions. Our extensive experiments show state-of-the-art performances in DETR-based methods on ActivityNet-v1.3 and THUMOS14 by a large margin. Moreover, we demonstrate that LTP significantly relieves the data scarcity issues in TAD.
Paper Structure (16 sections, 7 equations, 8 figures, 6 tables)

This paper contains 16 sections, 7 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: Problems from data scarcity. There are two main problems caused by the data scarcity in DETR for TAD: attention collapse and imbalanced performance. The first row shows the collapsed self-attention maps from encoder and decoder of DAB-DETR. The second depicts imbalanced performances in terms of action lengths (in Coverage) and the number of instances on ActivityNet-v1.3 from DETAD.
  • Figure 2: Differences b/w previous and our pre-training. Previous pre-training focused on the feature extractor. However, there has been no research conducted on pre-training DETR for TAD despite the issues on the data scarcity. The pretext tasks to train feature extractor and detector should be different. LTP is designed for class-wise localization from long-form videos, just like the downstream task.
  • Figure 3: Overall procedure of Long-Term Pre-trainig (LTP). LTP has two main components: 1) class-wise synthesis, 2) long-term pretext tasks. Class-wise synthesis aims to minimize the task discrepancy by building training features to localize based on categories. Moreover, the conditional tasks aim to learn the long-term dependency by ordinal or scale conditions.
  • Figure 4: Attention maps. It shows self-attention maps from the last layers of the DAB-DETR encoder ((a), (c)) and decoder ((b), (d)) in test samples of ActivityNet-v1.3.
  • Figure 5: Diversity of self-attention maps. To analyze the effect of our pre-training for the attention collapse, we measure the diversity defined in Eq. \ref{['eq:diversity']} of the self-attention maps.
  • ...and 3 more figures