Human-Centric Transformer for Domain Adaptive Action Recognition

Kun-Yu Lin; Jiaming Zhou; Wei-Shi Zheng

Human-Centric Transformer for Domain Adaptive Action Recognition

Kun-Yu Lin, Jiaming Zhou, Wei-Shi Zheng

TL;DR

The paper tackles domain adaptive action recognition by addressing a key shortcoming of prior domain-invariant methods: the loss of human-centric cues. It introduces HCTransformer, a decoupled architecture with a human encoder, a context encoder (guided by action-related context prototypes), and a human-context decoder to model domain-invariant human-context interactions. Through three levels of alignment (human, context, and human-context) and explicit temporal modeling, HCTransformer achieves state-of-the-art results on UCF-HMDB, Kinetics-NecDrone, and EPIC-Kitchens-UDA, while Grad-CAM analyses show a stronger focus on humans. The work demonstrates that concentrating on human-centric cues enhances transferability in action recognition and offers a versatile framework for both human-centric and hand-centric domains.

Abstract

We study the domain adaptation task for action recognition, namely domain adaptive action recognition, which aims to effectively transfer action recognition power from a label-sufficient source domain to a label-free target domain. Since actions are performed by humans, it is crucial to exploit human cues in videos when recognizing actions across domains. However, existing methods are prone to losing human cues but prefer to exploit the correlation between non-human contexts and associated actions for recognition, and the contexts of interest agnostic to actions would reduce recognition performance in the target domain. To overcome this problem, we focus on uncovering human-centric action cues for domain adaptive action recognition, and our conception is to investigate two aspects of human-centric action cues, namely human cues and human-context interaction cues. Accordingly, our proposed Human-Centric Transformer (HCTransformer) develops a decoupled human-centric learning paradigm to explicitly concentrate on human-centric action cues in domain-variant video feature learning. Our HCTransformer first conducts human-aware temporal modeling by a human encoder, aiming to avoid a loss of human cues during domain-invariant video feature learning. Then, by a Transformer-like architecture, HCTransformer exploits domain-invariant and action-correlated contexts by a context encoder, and further models domain-invariant interaction between humans and action-correlated contexts. We conduct extensive experiments on three benchmarks, namely UCF-HMDB, Kinetics-NecDrone and EPIC-Kitchens-UDA, and the state-of-the-art performance demonstrates the effectiveness of our proposed HCTransformer.

Human-Centric Transformer for Domain Adaptive Action Recognition

TL;DR

Abstract

Paper Structure (27 sections, 11 equations, 11 figures, 6 tables)

This paper contains 27 sections, 11 equations, 11 figures, 6 tables.

Introduction
Related Works
Human Activity Understanding
Domain Adaptive Action Recognition
General Domain Adaptation
Human-Centric Transformer
Problem Formulation
Decoupled Human-Centric Learning Paradigm
Human-Aware Temporal Modeling
Action-Correlated Temporal Modeling for Contexts
Human-Context Interaction Modeling
Experiments
Experimental Setups
Benchmarks
Training and test protocols
...and 12 more sections

Figures (11)

Figure 1: Top: An illustration of domain adaptive action recognition. We zoom in example videos of the action "shoot ball" for clarity. As shown in the figure, videos of the same action look very different in the source and target domains, due to environment change, viewpoint diversity, illumination shift, etc. Bottom: The performance of an action recognition model trained in the UCF domain drops significantly (24.3% in terms of accuracy) when testing in the HMDB domain. Best viewed in color.
Figure 2: Existing methods based on exploring domain invariance (e.g., TA3N DBLP:conf/iccv/ChenKAYCZ19) are prone to lose their focus on human cues in videos. We demonstrate Grad-CAM DBLP:journals/ijcv/SelvarajuCDVPB20 examples about the problem. As shown by the heatmaps, TA3N focuses on non-human context cues in videos. In some cases, contexts of interest are agnostic to the performing actions, e.g., floor of the court in the "fencing" video, which results in recognition errors. In contrast to TA3N, our proposed HCTransformer focuses on human-centric action cues closely related to the performing actions, e.g., the fencer's body parts in the "fencing" video. Videos in the figure are from the target domain of UCF-HMDB. Best viewed in color. Please refer to Figure \ref{['fig:more_baselines']} for the Grad-CAM visualization of more existing domain adaptive action recognition methods.
Figure 3: An overview of the proposed Human-Centric Transformer (HCTransformer). We use a five-clip video for demonstration (i.e., $M=5$). HCTransformer mainly consists of three components, namely human encoder, context encoder and human-context decoder, aiming at learning human-centric action cues and aligning feature distributions at different levels. The human encoder focuses on temporal modeling for human cues, where feature alignment is conducted at each temporal granularity respectively. By introducing context prototypes as extra tokens, the context encoder exploits domain-invariant and action-correlated contexts in non-human parts of videos using self-attention and cross-attention modules. By taking the outputs of the two encoders as inputs, the human-context decoder further models the interaction between humans and action-correlated contexts using cross-attention modules. Best viewed in color.
Figure 4: Grad-CAM visualization of more existing domain adaptive action recognition methods (in addition to TA3N DBLP:conf/iccv/ChenKAYCZ19 as shown in Figure \ref{['fig:intro']}), i.e., ABG DBLP:conf/mm/LuoHW0B20, ACAN xu2022aligning, SAVA DBLP:conf/eccv/ChoiSSH20 and CoMix DBLP:conf/nips/SahooSPSD21. For a clearer comparison, we also show the visualization results of our HCTransformer. The results demonstrate that existing methods are prone to losing human cues in videos and our proposed HCTransformer focuses on human-centric action cues closely related to the performing actions. Videos in the figure are from the target domain of UCF-HMDB. Best viewed in color.
Figure 5: Quantitative analysis to the relative amount of static and dynamic information encoded by different domain adaptive action recognition models on UCF$\to$HMDB. Following Kowal et al. DBLP:conf/cvpr/KowalSIBWD22, we use the unit-wise metric to quantify how many channels encode static/dynamic/joint information ("joint" means both static and dynamic information are encoded). For all the models, we use I3D as the backbone and use the features before classifier for analysis. We empirically find that each domain adaptive action recognition model has two types of channels, namely static and joint ones. Best viewed in color.
...and 6 more figures

Human-Centric Transformer for Domain Adaptive Action Recognition

TL;DR

Abstract

Human-Centric Transformer for Domain Adaptive Action Recognition

Authors

TL;DR

Abstract

Table of Contents

Figures (11)