GTAutoAct: An Automatic Datasets Generation Framework Based on Game Engine Redevelopment for Action Recognition
Xingyu Song, Zhan Li, Shi Chen, Kazuyuki Demachi
TL;DR
GTAutoAct introduces a game-engine–driven framework to auto-generate large-scale, multimodal action recognition datasets with high visual quality and diverse viewpoints. It converts real actions into a rotation-based 3D motion representation, enabling robust cross-view synthesis, and employs dynamic skeletal interpolation to produce smooth, varied animations. The pipeline includes scene customization via FiveM and an autonomous auto-collection process with random camera trajectories and hierarchical annotation, significantly reducing manual labeling. Experimental results on NTU and H36M benchmarks demonstrate that GTAutoAct-generated data can rival or exceed real-data baselines, especially under limited-frame scenarios, highlighting its potential to bridge the synthetic-real domain gap and improve action-recognition training efficiency.
Abstract
Current datasets for action recognition tasks face limitations stemming from traditional collection and generation methods, including the constrained range of action classes, absence of multi-viewpoint recordings, limited diversity, poor video quality, and labor-intensive manually collection. To address these challenges, we introduce GTAutoAct, a innovative dataset generation framework leveraging game engine technology to facilitate advancements in action recognition. GTAutoAct excels in automatically creating large-scale, well-annotated datasets with extensive action classes and superior video quality. Our framework's distinctive contributions encompass: (1) it innovatively transforms readily available coordinate-based 3D human motion into rotation-orientated representation with enhanced suitability in multiple viewpoints; (2) it employs dynamic segmentation and interpolation of rotation sequences to create smooth and realistic animations of action; (3) it offers extensively customizable animation scenes; (4) it implements an autonomous video capture and processing pipeline, featuring a randomly navigating camera, with auto-trimming and labeling functionalities. Experimental results underscore the framework's robustness and highlights its potential to significantly improve action recognition model training.
