Table of Contents
Fetching ...

Adversarially Masked Video Consistency for Unsupervised Domain Adaptation

Xiaoyu Zhu, Junwei Liang, Po-Yao Huang, Alex Hauptmann

TL;DR

A transformer-based model to learn class-discriminative and domain-invariant feature representations and enforces the prediction consistency between the masked target videos and their full forms is proposed.

Abstract

We study the problem of unsupervised domain adaptation for egocentric videos. We propose a transformer-based model to learn class-discriminative and domain-invariant feature representations. It consists of two novel designs. The first module is called Generative Adversarial Domain Alignment Network with the aim of learning domain-invariant representations. It simultaneously learns a mask generator and a domain-invariant encoder in an adversarial way. The domain-invariant encoder is trained to minimize the distance between the source and target domain. The masking generator, conversely, aims at producing challenging masks by maximizing the domain distance. The second is a Masked Consistency Learning module to learn class-discriminative representations. It enforces the prediction consistency between the masked target videos and their full forms. To better evaluate the effectiveness of domain adaptation methods, we construct a more challenging benchmark for egocentric videos, U-Ego4D. Our method achieves state-of-the-art performance on the Epic-Kitchen and the proposed U-Ego4D benchmark.

Adversarially Masked Video Consistency for Unsupervised Domain Adaptation

TL;DR

A transformer-based model to learn class-discriminative and domain-invariant feature representations and enforces the prediction consistency between the masked target videos and their full forms is proposed.

Abstract

We study the problem of unsupervised domain adaptation for egocentric videos. We propose a transformer-based model to learn class-discriminative and domain-invariant feature representations. It consists of two novel designs. The first module is called Generative Adversarial Domain Alignment Network with the aim of learning domain-invariant representations. It simultaneously learns a mask generator and a domain-invariant encoder in an adversarial way. The domain-invariant encoder is trained to minimize the distance between the source and target domain. The masking generator, conversely, aims at producing challenging masks by maximizing the domain distance. The second is a Masked Consistency Learning module to learn class-discriminative representations. It enforces the prediction consistency between the masked target videos and their full forms. To better evaluate the effectiveness of domain adaptation methods, we construct a more challenging benchmark for egocentric videos, U-Ego4D. Our method achieves state-of-the-art performance on the Epic-Kitchen and the proposed U-Ego4D benchmark.
Paper Structure (28 sections, 6 equations, 4 figures, 6 tables)

This paper contains 28 sections, 6 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Visualization of the feature space for unsupervised domain adaptation methods. Existing state-of-the-art video domain adaptation models tzeng2017adversarialganin2016domainchen2019temporal used full-view input data to perform domain alignment as shown in (b). In this work, we propose a model that learns from adversarially masked samples, which can lead to the learning of effective domain-invariant and class-discriminative representations.
  • Figure 2: Overview of the proposed framework. There are two training stages to learn domain-invariant and class-discriminative representations. The goal of stage one (denoted by solid lines) is to align the source and target domains. As directly aligning the two domains using full views may lead to trivial solutions, we propose an adversarial mask generator to produce masked samples. This module is trained with the domain-invariant encoder in an adversarial way. For the training of stage two (denoted by dashed lines), we propose a Masked Consistent Learning module to enhance the model's understanding of the spatial-temporal context, and thus increase the class-discrimination ability. We first initialize the class-discriminative visual encoder using weights learned in stage one. Then we force the visual encoder to have consistent predictions on the full and masked views of the same target video. Our two-stage training framework learns effective domain-invariant and class-discriminative representations, with robustness to large domain gaps.
  • Figure 3: Left: Class distributions per domain for the U-Ego4D benchmark. Right: Videos collected from different regions are treated as different domains. Different from the Epic-Kitchen dataset which is limited to the kitchen scenario, the same action in the U-Ego4D benchmark can happen in totally different environments.
  • Figure 4: Visualizations of the Adversarially-Learned Masks.