Table of Contents
Fetching ...

GTAutoAct: An Automatic Datasets Generation Framework Based on Game Engine Redevelopment for Action Recognition

Xingyu Song, Zhan Li, Shi Chen, Kazuyuki Demachi

TL;DR

GTAutoAct introduces a game-engine–driven framework to auto-generate large-scale, multimodal action recognition datasets with high visual quality and diverse viewpoints. It converts real actions into a rotation-based 3D motion representation, enabling robust cross-view synthesis, and employs dynamic skeletal interpolation to produce smooth, varied animations. The pipeline includes scene customization via FiveM and an autonomous auto-collection process with random camera trajectories and hierarchical annotation, significantly reducing manual labeling. Experimental results on NTU and H36M benchmarks demonstrate that GTAutoAct-generated data can rival or exceed real-data baselines, especially under limited-frame scenarios, highlighting its potential to bridge the synthetic-real domain gap and improve action-recognition training efficiency.

Abstract

Current datasets for action recognition tasks face limitations stemming from traditional collection and generation methods, including the constrained range of action classes, absence of multi-viewpoint recordings, limited diversity, poor video quality, and labor-intensive manually collection. To address these challenges, we introduce GTAutoAct, a innovative dataset generation framework leveraging game engine technology to facilitate advancements in action recognition. GTAutoAct excels in automatically creating large-scale, well-annotated datasets with extensive action classes and superior video quality. Our framework's distinctive contributions encompass: (1) it innovatively transforms readily available coordinate-based 3D human motion into rotation-orientated representation with enhanced suitability in multiple viewpoints; (2) it employs dynamic segmentation and interpolation of rotation sequences to create smooth and realistic animations of action; (3) it offers extensively customizable animation scenes; (4) it implements an autonomous video capture and processing pipeline, featuring a randomly navigating camera, with auto-trimming and labeling functionalities. Experimental results underscore the framework's robustness and highlights its potential to significantly improve action recognition model training.

GTAutoAct: An Automatic Datasets Generation Framework Based on Game Engine Redevelopment for Action Recognition

TL;DR

GTAutoAct introduces a game-engine–driven framework to auto-generate large-scale, multimodal action recognition datasets with high visual quality and diverse viewpoints. It converts real actions into a rotation-based 3D motion representation, enabling robust cross-view synthesis, and employs dynamic skeletal interpolation to produce smooth, varied animations. The pipeline includes scene customization via FiveM and an autonomous auto-collection process with random camera trajectories and hierarchical annotation, significantly reducing manual labeling. Experimental results on NTU and H36M benchmarks demonstrate that GTAutoAct-generated data can rival or exceed real-data baselines, especially under limited-frame scenarios, highlighting its potential to bridge the synthetic-real domain gap and improve action-recognition training efficiency.

Abstract

Current datasets for action recognition tasks face limitations stemming from traditional collection and generation methods, including the constrained range of action classes, absence of multi-viewpoint recordings, limited diversity, poor video quality, and labor-intensive manually collection. To address these challenges, we introduce GTAutoAct, a innovative dataset generation framework leveraging game engine technology to facilitate advancements in action recognition. GTAutoAct excels in automatically creating large-scale, well-annotated datasets with extensive action classes and superior video quality. Our framework's distinctive contributions encompass: (1) it innovatively transforms readily available coordinate-based 3D human motion into rotation-orientated representation with enhanced suitability in multiple viewpoints; (2) it employs dynamic segmentation and interpolation of rotation sequences to create smooth and realistic animations of action; (3) it offers extensively customizable animation scenes; (4) it implements an autonomous video capture and processing pipeline, featuring a randomly navigating camera, with auto-trimming and labeling functionalities. Experimental results underscore the framework's robustness and highlights its potential to significantly improve action recognition model training.
Paper Structure (39 sections, 38 equations, 15 figures, 8 tables, 1 algorithm)

This paper contains 39 sections, 38 equations, 15 figures, 8 tables, 1 algorithm.

Figures (15)

  • Figure 1: Overview of GTAutoAct.
  • Figure 2: Configuration of 53 bone joints in human motion representation system in GTAutoAct. Red sign labeled with "R" denotes the root bone joint. Gray signs labeled with "S" denote the static bone joints. Orange, green, and blue signs labeled with "3", "2", "1" respectively, correspond to three-, two-, and one-dimensional bone joints. Violet arrows denote the hierarchical inheritance directions from parent joint to child joint.
  • Figure 3: Schematic diagram of Euler angle calculation. $\boldsymbol{p}_{shoulder}$ (rotation center), $\boldsymbol{p}_{elbow}$, and $\boldsymbol{p}_{wrist}$ represent the nodes corresponding to the shoulder, elbow and wrist joints in 3D World Coordinate System respectively. Given the initial target vector $\Vec{v}_{init}$ and initial reference vector $\Vec{r}_{init}$, and proceeding to rotate them sequentially around the $x,y \text{ and } z$ axes by angles $\alpha,\beta, \text{ and } \gamma$ respectively, $(\Vec{v}_x,\Vec{r}_x)$, $(\Vec{v}_y,\Vec{r}_y)$ and finally $(\Vec{v},\Vec{r})$ can be obtained. The diagram displays a gradient arrow, signifying the Euler angle calculation order, which is in reverse to the rotation sequence. Consequently, we calculate $\gamma$ as the rotation angle from the positive $x$-axis to $\vec{v}_{xOy}$, which representing the projection of vector $\vec{v}$ onto the $xOy$ plane. Likewise, $\beta$ is inferred as the rotation angle between vector $\vec{v}$ and the $xOy$ plane. Lastly, $\alpha$ is calculated as the angle spanning from the positive $y$-axis to $\vec{r}_{x_{yOz}}$, the projection of $\vec{r}_{x}$ onto the $yOz$ plane.
  • Figure 4: Examples of customized scenes in FiveM.
  • Figure 5: Schematic diagram of RCM. The camera-shaped icons are the camera positions derived by RCM. The purple lines are the movement of camera between adjacent positions. $o$ is the initial point of camera and the position of character as well. $\Delta{magnitude_{xy}}$ and $\Delta\theta$ indicate the random distance and angle change in horizontal plane $xOy$. $\Delta{z}$ indicates the random height change in vertical $z$-axis.
  • ...and 10 more figures