Table of Contents
Fetching ...

An Animation-based Augmentation Approach for Action Recognition from Discontinuous Video

Xingyu Song, Zhan Li, Shi Chen, Xin-Qiang Cai, Kazuyuki Demachi

TL;DR

The paper tackles the problem of action recognition under discontinuous video frames, where standard CNNs lose temporal context. It introduces the 4A pipeline—Action Animation-based Augmentation—which converts real, discontinuous RGB clips into smooth, multi-view animations via four stages: 2D skeleton extraction, 3D orientation lifting with a Quaternion Graph Convolution Network (Q-GCN), Dynamic Skeletal Interpolation (DSI) for motion smoothing, and animation generation in a game-engine environment. Key contributions include the Q-GCN for robust 2D-to-3D orientation lifting with quaternions, the DSI module that preserves motion semantics during interpolation, and extensive experiments showing comparable or superior performance with only 10% of real data, plus improved results on in-the-wild videos. The approach effectively bridges the domain gap between synthetic and real data, enabling scalable augmentation for action recognition under data-discontinuity conditions.

Abstract

Action recognition, an essential component of computer vision, plays a pivotal role in multiple applications. Despite significant improvements brought by Convolutional Neural Networks (CNNs), these models suffer performance declines when trained with discontinuous video frames, which is a frequent scenario in real-world settings. This decline primarily results from the loss of temporal continuity, which is crucial for understanding the semantics of human actions. To overcome this issue, we introduce the 4A (Action Animation-based Augmentation Approach) pipeline, which employs a series of sophisticated techniques: starting with 2D human pose estimation from RGB videos, followed by Quaternion-based Graph Convolution Network for joint orientation and trajectory prediction, and Dynamic Skeletal Interpolation for creating smoother, diversified actions using game engine technology. This innovative approach generates realistic animations in varied game environments, viewed from multiple viewpoints. In this way, our method effectively bridges the domain gap between virtual and real-world data. In experimental evaluations, the 4A pipeline achieves comparable or even superior performance to traditional training approaches using real-world data, while requiring only 10% of the original data volume. Additionally, our approach demonstrates enhanced performance on In-the-wild videos, marking a significant advancement in the field of action recognition.

An Animation-based Augmentation Approach for Action Recognition from Discontinuous Video

TL;DR

The paper tackles the problem of action recognition under discontinuous video frames, where standard CNNs lose temporal context. It introduces the 4A pipeline—Action Animation-based Augmentation—which converts real, discontinuous RGB clips into smooth, multi-view animations via four stages: 2D skeleton extraction, 3D orientation lifting with a Quaternion Graph Convolution Network (Q-GCN), Dynamic Skeletal Interpolation (DSI) for motion smoothing, and animation generation in a game-engine environment. Key contributions include the Q-GCN for robust 2D-to-3D orientation lifting with quaternions, the DSI module that preserves motion semantics during interpolation, and extensive experiments showing comparable or superior performance with only 10% of real data, plus improved results on in-the-wild videos. The approach effectively bridges the domain gap between synthetic and real data, enabling scalable augmentation for action recognition under data-discontinuity conditions.

Abstract

Action recognition, an essential component of computer vision, plays a pivotal role in multiple applications. Despite significant improvements brought by Convolutional Neural Networks (CNNs), these models suffer performance declines when trained with discontinuous video frames, which is a frequent scenario in real-world settings. This decline primarily results from the loss of temporal continuity, which is crucial for understanding the semantics of human actions. To overcome this issue, we introduce the 4A (Action Animation-based Augmentation Approach) pipeline, which employs a series of sophisticated techniques: starting with 2D human pose estimation from RGB videos, followed by Quaternion-based Graph Convolution Network for joint orientation and trajectory prediction, and Dynamic Skeletal Interpolation for creating smoother, diversified actions using game engine technology. This innovative approach generates realistic animations in varied game environments, viewed from multiple viewpoints. In this way, our method effectively bridges the domain gap between virtual and real-world data. In experimental evaluations, the 4A pipeline achieves comparable or even superior performance to traditional training approaches using real-world data, while requiring only 10% of the original data volume. Additionally, our approach demonstrates enhanced performance on In-the-wild videos, marking a significant advancement in the field of action recognition.
Paper Structure (36 sections, 18 equations, 12 figures, 8 tables, 1 algorithm)

This paper contains 36 sections, 18 equations, 12 figures, 8 tables, 1 algorithm.

Figures (12)

  • Figure 1: Overview of 4A pipeline. Within 4A pipeline, we begin with a 2D human pose estimation method to extract the 2D coordinates of human skeleton coordinates from real-world RGB videos. This is followed by employing a Quaternion-based Graph Convolution Network (Q-GCN) to predict the orientation of each bone joint and the trajectory of body in 3D space. Subsequently, the Dynamic Skeletal Interpolation algorithm (DSI) ensures a smoother and more diversified action animation. After that, we use the game engine technology to generate the motion from skeleton representation sequence to form the animation. Finally, we present the animation in game environment with diverse environments and appearances, and captured in multiple viewpoints.
  • Figure 2: Whole architecture of Q-GCN. It starts with three vertex and edge convolutional blocks, with residual connection operation in each block. After extract the neighbor features, both layers are followed by a Squeeze and Excitation (SE) Block. Then, the concatenation of vertex graph and edge graph is followed by two fully connection layer with batch normalization and ReLU function in between.
  • Figure 3: Qualitative results of major-part representation derived from NTU-RGB+D dataset, comparing SURREACT with 4A. Our pipeline outperforms in terms of fidelity and realism in motion representation and excels in depicting character details. Furthermore, it achieves superior integration of characters within their environments, along with enhanced lighting and scene coverage.
  • Figure 4: Qualitative results of whole-body representation by 4A in multiple viewpoints, comparing with the original RGB frames in H36M.
  • Figure 5: Comparative analysis of different interpolation method. In the provided figure, the plots represented in black, green, purple, and red correspond to the Absolute Angular Distance (AAD) of the original sequence (Original), and sequences interpolated using Polynomial Interpolation (PI), Point-wise Polynomial Interpolation (PW-PI), and Dynamic Skeletal Interpolation (DSI), respectively. The five gray dot boxes illustrate the five randomly selected Quaternion sequence segments. The PI method yields a smooth sequence but lacks dynamic segmentation capabilities. PW-PI produces a segmented sequence but leads to a decrease in movement amplitude during interpolation, evident from the overly smoothed curve. DSI stands out by not only accurately segmenting the sequence but also preserving the semantic integrity of the motion, showcasing its superior capability in maintaining both fluidity and semantic richness in the interpolated sequence.
  • ...and 7 more figures