Table of Contents
Fetching ...

BiPO: Bidirectional Partial Occlusion Network for Text-to-Motion Synthesis

Seong-Eun Hong, Soobin Lim, Juyeong Hwang, Minwook Chang, Hyeongyeop Kang

TL;DR

BiPO tackles the challenge of generating natural, expressive 3D human motions from text by merging part-based motion generation with a bidirectional autoregressive framework, enabling both fine-grained control and long-horizon coherence. A Partial Occlusion technique is introduced to relax inter-part dependencies during training, improving robustness and diversity without requiring ground-truth motion length. On HumanML3D, BiPO achieves state-of-the-art FID and semantic alignment compared to ParCo, MoMask, and BAMM, and also excels in motion-editing tasks that re-synthesize partial sequences conditioned on text. The approach combines six part-specific VQ-VAE encoders/decoders and 14-layer transformers with carefully designed masking, offering practical benefits for animation, AR/VR, and game development where flexible, text-driven motion is needed.

Abstract

Generating natural and expressive human motions from textual descriptions is challenging due to the complexity of coordinating full-body dynamics and capturing nuanced motion patterns over extended sequences that accurately reflect the given text. To address this, we introduce BiPO, Bidirectional Partial Occlusion Network for Text-to-Motion Synthesis, a novel model that enhances text-to-motion synthesis by integrating part-based generation with a bidirectional autoregressive architecture. This integration allows BiPO to consider both past and future contexts during generation while enhancing detailed control over individual body parts without requiring ground-truth motion length. To relax the interdependency among body parts caused by the integration, we devise the Partial Occlusion technique, which probabilistically occludes the certain motion part information during training. In our comprehensive experiments, BiPO achieves state-of-the-art performance on the HumanML3D dataset, outperforming recent methods such as ParCo, MoMask, and BAMM in terms of FID scores and overall motion quality. Notably, BiPO excels not only in the text-to-motion generation task but also in motion editing tasks that synthesize motion based on partially generated motion sequences and textual descriptions. These results reveal the BiPO's effectiveness in advancing text-to-motion synthesis and its potential for practical applications.

BiPO: Bidirectional Partial Occlusion Network for Text-to-Motion Synthesis

TL;DR

BiPO tackles the challenge of generating natural, expressive 3D human motions from text by merging part-based motion generation with a bidirectional autoregressive framework, enabling both fine-grained control and long-horizon coherence. A Partial Occlusion technique is introduced to relax inter-part dependencies during training, improving robustness and diversity without requiring ground-truth motion length. On HumanML3D, BiPO achieves state-of-the-art FID and semantic alignment compared to ParCo, MoMask, and BAMM, and also excels in motion-editing tasks that re-synthesize partial sequences conditioned on text. The approach combines six part-specific VQ-VAE encoders/decoders and 14-layer transformers with carefully designed masking, offering practical benefits for animation, AR/VR, and game development where flexible, text-driven motion is needed.

Abstract

Generating natural and expressive human motions from textual descriptions is challenging due to the complexity of coordinating full-body dynamics and capturing nuanced motion patterns over extended sequences that accurately reflect the given text. To address this, we introduce BiPO, Bidirectional Partial Occlusion Network for Text-to-Motion Synthesis, a novel model that enhances text-to-motion synthesis by integrating part-based generation with a bidirectional autoregressive architecture. This integration allows BiPO to consider both past and future contexts during generation while enhancing detailed control over individual body parts without requiring ground-truth motion length. To relax the interdependency among body parts caused by the integration, we devise the Partial Occlusion technique, which probabilistically occludes the certain motion part information during training. In our comprehensive experiments, BiPO achieves state-of-the-art performance on the HumanML3D dataset, outperforming recent methods such as ParCo, MoMask, and BAMM in terms of FID scores and overall motion quality. Notably, BiPO excels not only in the text-to-motion generation task but also in motion editing tasks that synthesize motion based on partially generated motion sequences and textual descriptions. These results reveal the BiPO's effectiveness in advancing text-to-motion synthesis and its potential for practical applications.

Paper Structure

This paper contains 38 sections, 11 equations, 11 figures, 6 tables.

Figures (11)

  • Figure 1: BiPO generates diverse, high-quality 3D human motions from text prompts, capturing subtle nuances and motion details.
  • Figure 2: The architecture of our BiPO Transformer showing how the part-based learning is achieved. This is an example when i is Root.
  • Figure 3: Dual-iteration Cascaded Part-based Motion Decoding. In the first iteration, each part undergoes autoregressive decoding with a unidirectional causal mask to generate coarse-grained motion and predict sequence length. In the second iteration, a bidirectional causal mask is applied, allowing part-based bidirectional decoding to remove and predict even indexed motion tokens, resulting in a refined and coordinated motion sequence across all parts.
  • Figure 4: Qualitative comparison with existing methods. Words highlighted in red indicate the overall action, while words highlighted in blue specify how the action is performed.
  • Figure 5: User study results showing BiPO's preference rate over other models, with a dashed red line at the 50% threshold.
  • ...and 6 more figures