Language-free Compositional Action Generation via Decoupling Refinement
Xiao Liu, Guangyi Chen, Yansong Tang, Guangrun Wang, Xiao-Ping Zhang, Ser-Nam Lim
TL;DR
This work targets language-free compositional action generation in 3D, addressing the difficulty of combining simple sub-actions into simultaneous, unseen composites without textual annotations. It introduces a tripartite framework: Action Coupling uses energy-based attention masks and a Gaussian mixing scheme to synthesize pseudo-compositional data; Conditional Action Generation employs a CVAE to learn a latent space for diverse action synthesis; Decoupling Refinement leverages SMPL-based 3D-to-2D rendering and MAE inpainting to enforce semantic consistency between sub-actions and their composites. The authors create two benchmarks, HumanAct-C and UESTC-C, and demonstrate through quantitative metrics (FID, Acc, Div, Multimod) and qualitative visuals that their language-free approach surpasses baselines and text-guided methods in generating realistic, disentangled, and diverse compositional actions. This framework reduces reliance on costly language data and enables robust zero-shot compositional action generation with practical implications for animation, robotics, and virtual character control. Math expressions used include the CVAE objective and the coupling equations, denoted as $p(\tilde{\mathbf{y}}|\tilde{\mathbf{x}})$, $\mathcal{L}_{CVAE}$, and the mixing relations for $\tilde{\mathbf{x}}$ and $\tilde{\mathbf{y}}$.
Abstract
Composing simple elements into complex concepts is crucial yet challenging, especially for 3D action generation. Existing methods largely rely on extensive neural language annotations to discern composable latent semantics, a process that is often costly and labor-intensive. In this study, we introduce a novel framework to generate compositional actions without reliance on language auxiliaries. Our approach consists of three main components: Action Coupling, Conditional Action Generation, and Decoupling Refinement. Action Coupling utilizes an energy model to extract the attention masks of each sub-action, subsequently integrating two actions using these attentions to generate pseudo-training examples. Then, we employ a conditional generative model, CVAE, to learn a latent space, facilitating the diverse generation. Finally, we propose Decoupling Refinement, which leverages a self-supervised pre-trained model MAE to ensure semantic consistency between the sub-actions and compositional actions. This refinement process involves rendering generated 3D actions into 2D space, decoupling these images into two sub-segments, using the MAE model to restore the complete image from sub-segments, and constraining the recovered images to match images rendered from raw sub-actions. Due to the lack of existing datasets containing both sub-actions and compositional actions, we created two new datasets, named HumanAct-C and UESTC-C, and present a corresponding evaluation metric. Both qualitative and quantitative assessments are conducted to show our efficacy.
