Synthetic-to-Real Domain Adaptation for Action Recognition: A Dataset and Baseline Performances

Arun V. Reddy; Ketul Shah; William Paul; Rohita Mocharla; Judy Hoffman; Kapil D. Katyal; Dinesh Manocha; Celso M. de Melo; Rama Chellappa

Synthetic-to-Real Domain Adaptation for Action Recognition: A Dataset and Baseline Performances

Arun V. Reddy, Ketul Shah, William Paul, Rohita Mocharla, Judy Hoffman, Kapil D. Katyal, Dinesh Manocha, Celso M. de Melo, Rama Chellappa

TL;DR

The paper tackles synthetic-to-real domain shift in video-based action recognition by introducing RoCoG-v2, a dataset with real and synthetic gesture videos from ground and aerial viewpoints across seven gestures. It evaluates baseline action recognition and unsupervised domain adaptation methods (notably DANN and CO2A) using two backbones (I3D and X3D) and standardized preprocessing. Findings show that synthetic data can be valuable for training, but substantial domain gaps remain, especially for ground-to-air shifts; DA methods offer conditional improvements, particularly for challenging aerial viewpoints. The work highlights practical implications for robotics and offers directions for future research, such as motion realism analysis, multi-modal inputs, and viewpoint-specific adaptation techniques.

Abstract

Human action recognition is a challenging problem, particularly when there is high variability in factors such as subject appearance, backgrounds and viewpoint. While deep neural networks (DNNs) have been shown to perform well on action recognition tasks, they typically require large amounts of high-quality labeled data to achieve robust performance across a variety of conditions. Synthetic data has shown promise as a way to avoid the substantial costs and potential ethical concerns associated with collecting and labeling enormous amounts of data in the real-world. However, synthetic data may differ from real data in important ways. This phenomenon, known as \textit{domain shift}, can limit the utility of synthetic data in robotics applications. To mitigate the effects of domain shift, substantial effort is being dedicated to the development of domain adaptation (DA) techniques. Yet, much remains to be understood about how best to develop these techniques. In this paper, we introduce a new dataset called Robot Control Gestures (RoCoG-v2). The dataset is composed of both real and synthetic videos from seven gesture classes, and is intended to support the study of synthetic-to-real domain shift for video-based action recognition. Our work expands upon existing datasets by focusing the action classes on gestures for human-robot teaming, as well as by enabling investigation of domain shift in both ground and aerial views. We present baseline results using state-of-the-art action recognition and domain adaptation algorithms and offer initial insight on tackling the synthetic-to-real and ground-to-air domain shifts.

Synthetic-to-Real Domain Adaptation for Action Recognition: A Dataset and Baseline Performances

TL;DR

Abstract

Paper Structure (17 sections, 3 figures, 3 tables)

This paper contains 17 sections, 3 figures, 3 tables.

Introduction
Related Work
Synthetic Datasets
Synthetic-to-Real Transfer
Dataset
Baseline Experiments
Algorithms
Experimental Setup
Results
Discussion
Ground (Synthetic) $\rightarrow$ Ground (Real)
Air (Synthetic) $\rightarrow$ Air (Real)
Ground (Real) $\rightarrow$ Air (Real)
Ground (Synthetic) $\rightarrow$ Air (Real)
Class Confusion Analysis
...and 2 more sections

Figures (3)

Figure 1: a): Many existing domain adaptation datasets, including VisDA visda2017, focus on image classification or semantic segmentation. b): In this paper, we investigate domain adaptation techniques across two domains: synthetic-to-real and ground-to-air, focused on human action recognition.
Figure 2: The dataset consists of real and synthetic videos across the seven gesture classes, from both ground and air perspectives.
Figure 3: Confusion matrices for action recognition in the four UDA settings using CO2A with X3D backbone (averaged across three runs). Top Left: $G_S \rightarrow G_R$, Top Right: $A_S \rightarrow A_R$, Bottom Left: $G_R \rightarrow A_R$, Bottom Right: $G_S \rightarrow A_R$.

Synthetic-to-Real Domain Adaptation for Action Recognition: A Dataset and Baseline Performances

TL;DR

Abstract

Synthetic-to-Real Domain Adaptation for Action Recognition: A Dataset and Baseline Performances

Authors

TL;DR

Abstract

Table of Contents

Figures (3)