Synthetic-to-Real Domain Adaptation for Action Recognition: A Dataset and Baseline Performances
Arun V. Reddy, Ketul Shah, William Paul, Rohita Mocharla, Judy Hoffman, Kapil D. Katyal, Dinesh Manocha, Celso M. de Melo, Rama Chellappa
TL;DR
The paper tackles synthetic-to-real domain shift in video-based action recognition by introducing RoCoG-v2, a dataset with real and synthetic gesture videos from ground and aerial viewpoints across seven gestures. It evaluates baseline action recognition and unsupervised domain adaptation methods (notably DANN and CO2A) using two backbones (I3D and X3D) and standardized preprocessing. Findings show that synthetic data can be valuable for training, but substantial domain gaps remain, especially for ground-to-air shifts; DA methods offer conditional improvements, particularly for challenging aerial viewpoints. The work highlights practical implications for robotics and offers directions for future research, such as motion realism analysis, multi-modal inputs, and viewpoint-specific adaptation techniques.
Abstract
Human action recognition is a challenging problem, particularly when there is high variability in factors such as subject appearance, backgrounds and viewpoint. While deep neural networks (DNNs) have been shown to perform well on action recognition tasks, they typically require large amounts of high-quality labeled data to achieve robust performance across a variety of conditions. Synthetic data has shown promise as a way to avoid the substantial costs and potential ethical concerns associated with collecting and labeling enormous amounts of data in the real-world. However, synthetic data may differ from real data in important ways. This phenomenon, known as \textit{domain shift}, can limit the utility of synthetic data in robotics applications. To mitigate the effects of domain shift, substantial effort is being dedicated to the development of domain adaptation (DA) techniques. Yet, much remains to be understood about how best to develop these techniques. In this paper, we introduce a new dataset called Robot Control Gestures (RoCoG-v2). The dataset is composed of both real and synthetic videos from seven gesture classes, and is intended to support the study of synthetic-to-real domain shift for video-based action recognition. Our work expands upon existing datasets by focusing the action classes on gestures for human-robot teaming, as well as by enabling investigation of domain shift in both ground and aerial views. We present baseline results using state-of-the-art action recognition and domain adaptation algorithms and offer initial insight on tackling the synthetic-to-real and ground-to-air domain shifts.
