Table of Contents
Fetching ...

Language-free Compositional Action Generation via Decoupling Refinement

Xiao Liu, Guangyi Chen, Yansong Tang, Guangrun Wang, Xiao-Ping Zhang, Ser-Nam Lim

TL;DR

This work targets language-free compositional action generation in 3D, addressing the difficulty of combining simple sub-actions into simultaneous, unseen composites without textual annotations. It introduces a tripartite framework: Action Coupling uses energy-based attention masks and a Gaussian mixing scheme to synthesize pseudo-compositional data; Conditional Action Generation employs a CVAE to learn a latent space for diverse action synthesis; Decoupling Refinement leverages SMPL-based 3D-to-2D rendering and MAE inpainting to enforce semantic consistency between sub-actions and their composites. The authors create two benchmarks, HumanAct-C and UESTC-C, and demonstrate through quantitative metrics (FID, Acc, Div, Multimod) and qualitative visuals that their language-free approach surpasses baselines and text-guided methods in generating realistic, disentangled, and diverse compositional actions. This framework reduces reliance on costly language data and enables robust zero-shot compositional action generation with practical implications for animation, robotics, and virtual character control. Math expressions used include the CVAE objective and the coupling equations, denoted as $p(\tilde{\mathbf{y}}|\tilde{\mathbf{x}})$, $\mathcal{L}_{CVAE}$, and the mixing relations for $\tilde{\mathbf{x}}$ and $\tilde{\mathbf{y}}$.

Abstract

Composing simple elements into complex concepts is crucial yet challenging, especially for 3D action generation. Existing methods largely rely on extensive neural language annotations to discern composable latent semantics, a process that is often costly and labor-intensive. In this study, we introduce a novel framework to generate compositional actions without reliance on language auxiliaries. Our approach consists of three main components: Action Coupling, Conditional Action Generation, and Decoupling Refinement. Action Coupling utilizes an energy model to extract the attention masks of each sub-action, subsequently integrating two actions using these attentions to generate pseudo-training examples. Then, we employ a conditional generative model, CVAE, to learn a latent space, facilitating the diverse generation. Finally, we propose Decoupling Refinement, which leverages a self-supervised pre-trained model MAE to ensure semantic consistency between the sub-actions and compositional actions. This refinement process involves rendering generated 3D actions into 2D space, decoupling these images into two sub-segments, using the MAE model to restore the complete image from sub-segments, and constraining the recovered images to match images rendered from raw sub-actions. Due to the lack of existing datasets containing both sub-actions and compositional actions, we created two new datasets, named HumanAct-C and UESTC-C, and present a corresponding evaluation metric. Both qualitative and quantitative assessments are conducted to show our efficacy.

Language-free Compositional Action Generation via Decoupling Refinement

TL;DR

This work targets language-free compositional action generation in 3D, addressing the difficulty of combining simple sub-actions into simultaneous, unseen composites without textual annotations. It introduces a tripartite framework: Action Coupling uses energy-based attention masks and a Gaussian mixing scheme to synthesize pseudo-compositional data; Conditional Action Generation employs a CVAE to learn a latent space for diverse action synthesis; Decoupling Refinement leverages SMPL-based 3D-to-2D rendering and MAE inpainting to enforce semantic consistency between sub-actions and their composites. The authors create two benchmarks, HumanAct-C and UESTC-C, and demonstrate through quantitative metrics (FID, Acc, Div, Multimod) and qualitative visuals that their language-free approach surpasses baselines and text-guided methods in generating realistic, disentangled, and diverse compositional actions. This framework reduces reliance on costly language data and enables robust zero-shot compositional action generation with practical implications for animation, robotics, and virtual character control. Math expressions used include the CVAE objective and the coupling equations, denoted as , , and the mixing relations for and .

Abstract

Composing simple elements into complex concepts is crucial yet challenging, especially for 3D action generation. Existing methods largely rely on extensive neural language annotations to discern composable latent semantics, a process that is often costly and labor-intensive. In this study, we introduce a novel framework to generate compositional actions without reliance on language auxiliaries. Our approach consists of three main components: Action Coupling, Conditional Action Generation, and Decoupling Refinement. Action Coupling utilizes an energy model to extract the attention masks of each sub-action, subsequently integrating two actions using these attentions to generate pseudo-training examples. Then, we employ a conditional generative model, CVAE, to learn a latent space, facilitating the diverse generation. Finally, we propose Decoupling Refinement, which leverages a self-supervised pre-trained model MAE to ensure semantic consistency between the sub-actions and compositional actions. This refinement process involves rendering generated 3D actions into 2D space, decoupling these images into two sub-segments, using the MAE model to restore the complete image from sub-segments, and constraining the recovered images to match images rendered from raw sub-actions. Due to the lack of existing datasets containing both sub-actions and compositional actions, we created two new datasets, named HumanAct-C and UESTC-C, and present a corresponding evaluation metric. Both qualitative and quantitative assessments are conducted to show our efficacy.
Paper Structure (26 sections, 15 equations, 10 figures, 3 tables)

This paper contains 26 sections, 15 equations, 10 figures, 3 tables.

Figures (10)

  • Figure 1: The comparison between separate action generation and compositional action generation. The compositional method aims to amalgamate two sub-actions, like "walking" and "drinking", into a simultaneous, unseen composite concept - "walking + drinking".
  • Figure 2: The pipeline of our framework. It contains three main components, including Action Coupling (in orange), Conditional Action Generation(in green), and Decoupling Refinement (in blue). These processes involve identifying active regions of sub-actions with motion energy, and mixing sub-actions as pseudo-compositional actions for training a conditional generation model. Generated compositional actions are then converted and decoupled into masked images. A pre-trained MAE model is used to recover these images.
  • Figure 3: Compositional 3D motion generations with different categories on the HumanAct-C and UESTC-C datasets.
  • Figure 4: The qualitative comparisons between our methods and other baseline methods.
  • Figure A.1: The visualization of examples of our dataset HumanAct-C and UESTC-C.
  • ...and 5 more figures