Table of Contents
Fetching ...

Move-in-2D: 2D-Conditioned Human Motion Generation

Hsin-Ping Huang, Yang Zhou, Jui-Hsien Wang, Difan Liu, Feng Liu, Ming-Hsuan Yang, Zhan Xu

TL;DR

Move-in-2D addresses the challenge of generating realistic human motion in arbitrary scenes by conditioning motion generation on a 2D background image and a text prompt. The authors propose a diffusion-based framework with in-context conditioning via a Multi-Conditional Transformer, trained on a newly collected HiC-Motion dataset of 300k real-world videos annotated with 3D SMPL motions. The approach achieves scene- and text-aligned motion that projects naturally onto 2D scenes and improves downstream video generation quality, outperforming text-only and 3D-scene baselines. This 2D-conditioning paradigm broadens applicability to diverse environments without 3D scene reconstruction, enabling affordance-aware, large-dynamic human motions in video synthesis.

Abstract

Generating realistic human videos remains a challenging task, with the most effective methods currently relying on a human motion sequence as a control signal. Existing approaches often use existing motion extracted from other videos, which restricts applications to specific motion types and global scene matching. We propose Move-in-2D, a novel approach to generate human motion sequences conditioned on a scene image, allowing for diverse motion that adapts to different scenes. Our approach utilizes a diffusion model that accepts both a scene image and text prompt as inputs, producing a motion sequence tailored to the scene. To train this model, we collect a large-scale video dataset featuring single-human activities, annotating each video with the corresponding human motion as the target output. Experiments demonstrate that our method effectively predicts human motion that aligns with the scene image after projection. Furthermore, we show that the generated motion sequence improves human motion quality in video synthesis tasks.

Move-in-2D: 2D-Conditioned Human Motion Generation

TL;DR

Move-in-2D addresses the challenge of generating realistic human motion in arbitrary scenes by conditioning motion generation on a 2D background image and a text prompt. The authors propose a diffusion-based framework with in-context conditioning via a Multi-Conditional Transformer, trained on a newly collected HiC-Motion dataset of 300k real-world videos annotated with 3D SMPL motions. The approach achieves scene- and text-aligned motion that projects naturally onto 2D scenes and improves downstream video generation quality, outperforming text-only and 3D-scene baselines. This 2D-conditioning paradigm broadens applicability to diverse environments without 3D scene reconstruction, enabling affordance-aware, large-dynamic human motions in video synthesis.

Abstract

Generating realistic human videos remains a challenging task, with the most effective methods currently relying on a human motion sequence as a control signal. Existing approaches often use existing motion extracted from other videos, which restricts applications to specific motion types and global scene matching. We propose Move-in-2D, a novel approach to generate human motion sequences conditioned on a scene image, allowing for diverse motion that adapts to different scenes. Our approach utilizes a diffusion model that accepts both a scene image and text prompt as inputs, producing a motion sequence tailored to the scene. To train this model, we collect a large-scale video dataset featuring single-human activities, annotating each video with the corresponding human motion as the target output. Experiments demonstrate that our method effectively predicts human motion that aligns with the scene image after projection. Furthermore, we show that the generated motion sequence improves human motion quality in video synthesis tasks.

Paper Structure

This paper contains 13 sections, 1 equation, 6 figures, 4 tables.

Figures (6)

  • Figure 1: 2D-conditioned human motion generation. Given an image representing the target scene and a text prompt describing the desired motion, we generate a motion sequence that aligns with the text description and projects naturally onto the scene image. This generated motion then serves as the control signal for the subsequent video generation tasks.
  • Figure 2: Overview. The text prompt and background scene image are encoded by the CLIP and DINO encoders, and incorporated into the model via in-context conditioning. The AdaLN layer receives the diffusion timestep as input. Our multi-conditional transformer model then generates a human motion sequence through a diffusion denoising process, aligning the generated motion with both input conditions.
  • Figure 3: Affordance-aware human generation. Our model generates human poses consistent with both text prompts and scene context, such as standing on a cliff. It also supports complex human-scene interactions, including activities like petting a dog.
  • Figure 4: Motion generation with large dynamics. Our results show motion sequences that are accurately placed and move within scenes, such as playing tennis, enabling the generation of complex human activities that are challenging for video generation models.
  • Figure 5: Comparison to state-of-the-art.MDM and SceneDiff produces implausible poses, MLD generates mismatched motion with the scene, and HUMANISE generates static poses. Our method generates coherent motion aligned with both the scene and text prompts.
  • ...and 1 more figures