Table of Contents
Fetching ...

SADA: Semantic adversarial unsupervised domain adaptation for Temporal Action Localization

David Pujol-Perich, Albert Clapés, Sergio Escalera

TL;DR

SADA introduces the first unsupervised domain adaptation method tailored for sparse temporal action localization by enforcing semantic, per-class alignment across source and target domains. It couples a multi-resolution, anchor-based TAL backbone with a novel local-class and background-aware adversarial loss, facilitated by pseudo-labels for the unlabeled domain. The approach yields robust cross-domain transfer, outperforming fully supervised baselines and existing UDA methods across seven realistic domain-shift benchmarks derived from EpicKitchens100 and CharadesEgo, with gains up to 6.14% mAP. This work provides a practical, scalable framework for TAL in real-world, domain-heterogeneous video settings and introduces comprehensive benchmarks to evaluate UDA in sparse TAL.

Abstract

Temporal Action Localization (TAL) is a complex task that poses relevant challenges, particularly when attempting to generalize on new -- unseen -- domains in real-world applications. These scenarios, despite realistic, are often neglected in the literature, exposing these solutions to important performance degradation. In this work, we tackle this issue by introducing, for the first time, an approach for Unsupervised Domain Adaptation (UDA) in sparse TAL, which we refer to as Semantic Adversarial unsupervised Domain Adaptation (SADA). Our contributions are threefold: (1) we pioneer the development of a domain adaptation model that operates on realistic sparse action detection benchmarks; (2) we tackle the limitations of global-distribution alignment techniques by introducing a novel adversarial loss that is sensitive to local class distributions, ensuring finer-grained adaptation; and (3) we present a novel set of benchmarks based on EpicKitchens100 and CharadesEgo, that evaluate multiple domain shifts in a comprehensive manner. Our experiments indicate that SADA improves the adaptation across domains when compared to fully supervised state-of-the-art and alternative UDA methods, attaining a performance boost of up to 6.14% mAP.

SADA: Semantic adversarial unsupervised domain adaptation for Temporal Action Localization

TL;DR

SADA introduces the first unsupervised domain adaptation method tailored for sparse temporal action localization by enforcing semantic, per-class alignment across source and target domains. It couples a multi-resolution, anchor-based TAL backbone with a novel local-class and background-aware adversarial loss, facilitated by pseudo-labels for the unlabeled domain. The approach yields robust cross-domain transfer, outperforming fully supervised baselines and existing UDA methods across seven realistic domain-shift benchmarks derived from EpicKitchens100 and CharadesEgo, with gains up to 6.14% mAP. This work provides a practical, scalable framework for TAL in real-world, domain-heterogeneous video settings and introduces comprehensive benchmarks to evaluate UDA in sparse TAL.

Abstract

Temporal Action Localization (TAL) is a complex task that poses relevant challenges, particularly when attempting to generalize on new -- unseen -- domains in real-world applications. These scenarios, despite realistic, are often neglected in the literature, exposing these solutions to important performance degradation. In this work, we tackle this issue by introducing, for the first time, an approach for Unsupervised Domain Adaptation (UDA) in sparse TAL, which we refer to as Semantic Adversarial unsupervised Domain Adaptation (SADA). Our contributions are threefold: (1) we pioneer the development of a domain adaptation model that operates on realistic sparse action detection benchmarks; (2) we tackle the limitations of global-distribution alignment techniques by introducing a novel adversarial loss that is sensitive to local class distributions, ensuring finer-grained adaptation; and (3) we present a novel set of benchmarks based on EpicKitchens100 and CharadesEgo, that evaluate multiple domain shifts in a comprehensive manner. Our experiments indicate that SADA improves the adaptation across domains when compared to fully supervised state-of-the-art and alternative UDA methods, attaining a performance boost of up to 6.14% mAP.
Paper Structure (31 sections, 21 equations, 15 figures, 18 tables)

This paper contains 31 sections, 21 equations, 15 figures, 18 tables.

Figures (15)

  • Figure 1: Illustration of the differences between the two most similar domain-adaptation methods ganin2016domainxie2018learning, and our proposal, SADA. For this, we present a simple scenario with various anchor embeddings of different actions (identified by shapes) and domains (identified by colors). In this scenario, ganin2016domain (upper row) aligns embeddings in a class-agnostic manner, making it liable to aligning domain embeddings of unmatching action labels. xie2018learning (middle row) computes class-wise mean centroids, and aligns them across domains, but as shown, minimizing their distance does not yield a proper adaptation. SADA (last row) improves ganin2016domain by aligning class-wise distributions, yielding the correct alignment by not aligning unmatching anchors.
  • Figure 2: Overview of the main model architecture of SADA. This takes as input videos from a Source and a Target domain, which are both fed to a shared multi-resolution feature extractor pyramid. The output embeddings of both of these domains are then aligned using the semantic alignment loss, SADA. This is done with a level and class-wise domain discriminator of the filtered embeddings, based on GT information and pseudo labels, for the source and target domains, respectively. Finally, the resulting domain invariant representations of the source domain are used to train a classification and localization head to learn the underlying task.
  • Figure 3: Overview of the 6 proposed experimental setups for EpicKitchens100. Concretely, S1 and S2 evaluate the videos from the original EK55. They define the dark-counter and white-counter kitchens as Source, respectively, and the rest as Target. S3 and S4 are similar except that they consider only the newest videos from EK100. S5 and S6 use the old videos as Source and the new videos as Target, and vice versa.
  • Figure 4: Visualization of the predicted segments of our method and the chosen set of source-only (SO). We include on top the ground-truth (GT) segments as a reference.
  • Figure 5: TSNE plots of the source-only (SO) variation of our model (top row) and our proposed domain adaptation model (DA) (bottom row). Find in the first 3 columns the TSNE plots of action classes 1 to 3 of the source (red) and target (blue) domain anchors. The last column shows the plot of the background anchors, so those not assigned to any GT label.
  • ...and 10 more figures