Table of Contents
Fetching ...

TIMotion: Temporal and Interactive Framework for Efficient Human-Human Motion Generation

Yabiao Wang, Shuo Wang, Jiangning Zhang, Ke Fan, Jiafu Wu, Zhucun Xue, Yong Liu

TL;DR

TIMotion addresses the challenge of two-person motion generation by embedding temporal dynamics and inter-person interactions within a general MetaMotion framework. It introduces three core components—Causal Interactive Injection, Role-Evolving Scanning, and Localized Pattern Amplification—to capture causal interactions, dynamic role switches, and short-term motion patterns, respectively, and is compatible with Transformer, Mamba, and RWKV backbones. Extensive experiments on InterHuman and InterX show TIMotion achieves state-of-the-art or competitive results with improved efficiency and editability, validated by both quantitative metrics and qualitative assessments. This approach provides a scalable, flexible paradigm for realistic human-human motion generation and editing with potential applications in animation, robotics, and human-comrobot collaboration.

Abstract

Human-human motion generation is essential for understanding humans as social beings. Current methods fall into two main categories: single-person-based methods and separate modeling-based methods. To delve into this field, we abstract the overall generation process into a general framework MetaMotion, which consists of two phases: temporal modeling and interaction mixing. For temporal modeling, the single-person-based methods concatenate two people into a single one directly, while the separate modeling-based methods skip the modeling of interaction sequences. The inadequate modeling described above resulted in sub-optimal performance and redundant model parameters. In this paper, we introduce TIMotion (Temporal and Interactive Modeling), an efficient and effective framework for human-human motion generation. Specifically, we first propose Causal Interactive Injection to model two separate sequences as a causal sequence leveraging the temporal and causal properties. Then we present Role-Evolving Scanning to adjust to the change in the active and passive roles throughout the interaction. Finally, to generate smoother and more rational motion, we design Localized Pattern Amplification to capture short-term motion patterns. Extensive experiments on InterHuman and InterX demonstrate that our method achieves superior performance. Project page: https://aigc-explorer.github.io/TIMotion-page/

TIMotion: Temporal and Interactive Framework for Efficient Human-Human Motion Generation

TL;DR

TIMotion addresses the challenge of two-person motion generation by embedding temporal dynamics and inter-person interactions within a general MetaMotion framework. It introduces three core components—Causal Interactive Injection, Role-Evolving Scanning, and Localized Pattern Amplification—to capture causal interactions, dynamic role switches, and short-term motion patterns, respectively, and is compatible with Transformer, Mamba, and RWKV backbones. Extensive experiments on InterHuman and InterX show TIMotion achieves state-of-the-art or competitive results with improved efficiency and editability, validated by both quantitative metrics and qualitative assessments. This approach provides a scalable, flexible paradigm for realistic human-human motion generation and editing with potential applications in animation, robotics, and human-comrobot collaboration.

Abstract

Human-human motion generation is essential for understanding humans as social beings. Current methods fall into two main categories: single-person-based methods and separate modeling-based methods. To delve into this field, we abstract the overall generation process into a general framework MetaMotion, which consists of two phases: temporal modeling and interaction mixing. For temporal modeling, the single-person-based methods concatenate two people into a single one directly, while the separate modeling-based methods skip the modeling of interaction sequences. The inadequate modeling described above resulted in sub-optimal performance and redundant model parameters. In this paper, we introduce TIMotion (Temporal and Interactive Modeling), an efficient and effective framework for human-human motion generation. Specifically, we first propose Causal Interactive Injection to model two separate sequences as a causal sequence leveraging the temporal and causal properties. Then we present Role-Evolving Scanning to adjust to the change in the active and passive roles throughout the interaction. Finally, to generate smoother and more rational motion, we design Localized Pattern Amplification to capture short-term motion patterns. Extensive experiments on InterHuman and InterX demonstrate that our method achieves superior performance. Project page: https://aigc-explorer.github.io/TIMotion-page/
Paper Structure (32 sections, 26 equations, 6 figures, 7 tables, 1 algorithm)

This paper contains 32 sections, 26 equations, 6 figures, 7 tables, 1 algorithm.

Figures (6)

  • Figure 1: MetaMotion and performance of MetaMotion-based models on InterHuman validation set. We abstract the MetaMotion concept that illustrates the intrinsic properties of human-human motion generation in the interaction process. (a) and (b) show the two types of methods currently, and (c) shows our method TIMotion, LPA refers to the Localized Pattern Amplification. In (d) we compare the performance of the different methods on the InterHuman dataset.
  • Figure 2: The overall framework of our TIMotion. We contribute three primary technical designs. First, we propose Causal Interactive Injection to utilize the temporal properties of motion sequences. Then we present Role-Evolving Mixing to adjust to the ever-evolving roles during interaction. Finally, we design Localized Pattern Amplification to capture short-term motion patterns.
  • Figure 3: Illustration of changing active and passive roles. The first person acts as the active role in the early stages, and as time progresses, the other person becomes the active role of the motion.
  • Figure 4: Qualitative comparison with Intergen on human-human motion generation. Darker color indicates later frames. The sequences generated by TIMotion are more consistent with the text description.
  • Figure 5: Qualitative results on the motion in-betweening task. The first and last frames are fixed. Darker colors indicate later frames. Our method achieves smooth and natural transitions between the conditioned motions.
  • ...and 1 more figures