Table of Contents
Fetching ...

IKMo: Image-Keyframed Motion Generation with Trajectory-Pose Conditioned Motion Diffusion Model

Yang Zhao, Yan Zhang, Xubo Yang

TL;DR

IKMo introduces image-keyframed motion generation by decoupling trajectory and keyframe pose conditioning into two-stage parallel pipelines guided by diffusion models. A novel MLLM-driven multi-agent system translates user-provided images and text into a structured motion specification (motion description, keyframe poses, trajectory) which is then enforced by a Motion Diffusion Model with Trajectory Encoder, Pose Encoder, and ControlNet fusion. Empirical results on HumanML3D and KIT-ML show state-of-the-art performance under trajectory+keyframe constraints, with ablations confirming the importance of Motion Optimization and Motion ControlNet, and a user study demonstrating improved alignment with user intent. The approach enhances controllability and fidelity in hand-in-hand diffusion-driven motion synthesis, offering a practical pathway for image-based, user-guided animation scenarios.

Abstract

Existing human motion generation methods with trajectory and pose inputs operate global processing on both modalities, leading to suboptimal outputs. In this paper, we propose IKMo, an image-keyframed motion generation method based on the diffusion model with trajectory and pose being decoupled. The trajectory and pose inputs go through a two-stage conditioning framework. In the first stage, the dedicated optimization module is applied to refine inputs. In the second stage, trajectory and pose are encoded via a Trajectory Encoder and a Pose Encoder in parallel. Then, motion with high spatial and semantic fidelity is guided by a motion ControlNet, which processes the fused trajectory and pose data. Experiment results based on HumanML3D and KIT-ML datasets demonstrate that the proposed method outperforms state-of-the-art on all metrics under trajectory-keyframe constraints. In addition, MLLM-based agents are implemented to pre-process model inputs. Given texts and keyframe images from users, the agents extract motion descriptions, keyframe poses, and trajectories as the optimized inputs into the motion generation model. We conducts a user study with 10 participants. The experiment results prove that the MLLM-based agents pre-processing makes generated motion more in line with users' expectation. We believe that the proposed method improves both the fidelity and controllability of motion generation by the diffusion model.

IKMo: Image-Keyframed Motion Generation with Trajectory-Pose Conditioned Motion Diffusion Model

TL;DR

IKMo introduces image-keyframed motion generation by decoupling trajectory and keyframe pose conditioning into two-stage parallel pipelines guided by diffusion models. A novel MLLM-driven multi-agent system translates user-provided images and text into a structured motion specification (motion description, keyframe poses, trajectory) which is then enforced by a Motion Diffusion Model with Trajectory Encoder, Pose Encoder, and ControlNet fusion. Empirical results on HumanML3D and KIT-ML show state-of-the-art performance under trajectory+keyframe constraints, with ablations confirming the importance of Motion Optimization and Motion ControlNet, and a user study demonstrating improved alignment with user intent. The approach enhances controllability and fidelity in hand-in-hand diffusion-driven motion synthesis, offering a practical pathway for image-based, user-guided animation scenarios.

Abstract

Existing human motion generation methods with trajectory and pose inputs operate global processing on both modalities, leading to suboptimal outputs. In this paper, we propose IKMo, an image-keyframed motion generation method based on the diffusion model with trajectory and pose being decoupled. The trajectory and pose inputs go through a two-stage conditioning framework. In the first stage, the dedicated optimization module is applied to refine inputs. In the second stage, trajectory and pose are encoded via a Trajectory Encoder and a Pose Encoder in parallel. Then, motion with high spatial and semantic fidelity is guided by a motion ControlNet, which processes the fused trajectory and pose data. Experiment results based on HumanML3D and KIT-ML datasets demonstrate that the proposed method outperforms state-of-the-art on all metrics under trajectory-keyframe constraints. In addition, MLLM-based agents are implemented to pre-process model inputs. Given texts and keyframe images from users, the agents extract motion descriptions, keyframe poses, and trajectories as the optimized inputs into the motion generation model. We conducts a user study with 10 participants. The experiment results prove that the MLLM-based agents pre-processing makes generated motion more in line with users' expectation. We believe that the proposed method improves both the fidelity and controllability of motion generation by the diffusion model.

Paper Structure

This paper contains 33 sections, 11 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 1: (a) The overall pipeline of IKMo. Given an input image and textual requirement, our MLLM-based multi-agent system outputs a motion configuration consisting of a motion description, keyframe poses, and trajectory coordinates. This configuration is then fed into our Conditioned Motion Diffusion Model to generate the final human motion. (b) Details of the Conditioned Motion Diffusion Model. The model predicts a clean motion from a noised motion sequence and a text prompt, while being guided by keyframe poses and trajectory constraints. (c) Motion Optimization. Keyframe poses and trajectory constraints iteratively perturb the noised motion through gradient descent to better align with control signals. (d)Motion Control. Keyframe poses and trajectory inputs are encoded separately using a Pose Encoder and a Trajectory Encoder. The resulting features are fused and injected into the Motion ControlNet to guide motion generation.
  • Figure 2: Qualitative Results. All input images are generated by Doubao. Colored entity frames represent keyframes, while gray frames represent the other frames. The transparency of the gray frames indicates their position in the motion sequence, with more transparent frames appearing earlier. The green trajectory on the ground represents a standard trajectory. To provide a clearer and consistent view for comparison, we applied translation and rotation to some results. Origin thumbnail represents the original version of the motion.
  • Figure 3: Qualitative results using video/text inputs. Both methods are given the same textual prompt: "The person is performing a dance routine involving a sequence of movements. These include gestures with the arms raised, swinging from side to side, and leg kicks."
  • Figure 4: Ablation results. All input images are generated by Doubao.