Table of Contents
Fetching ...

HUMOTO: A 4D Dataset of Mocap Human Object Interactions

Jiaxin Lu, Chun-Hao Paul Huang, Uttaran Bhattacharya, Qixing Huang, Yi Zhou

TL;DR

HUMOTO presents a high-fidelity 4D dataset of human-object interactions capturing rich hand and body motion across 63 objects and 72 parts, acquired with a multi-sensor mocap pipeline and artist-led refinement. A Scene-Driven LLM Scripting approach seeds cohesive, task-oriented interactions, while rigorous data cleaning and multi-level textual annotations enhance usability for motion generation, robotics, and vision. Quantitative and perceptual evaluations show HUMOTO achieves superior hand pose fidelity, low foot sliding, reduced penetration, and favorable interaction quality compared with prior HOI datasets. The work demonstrates HUMOTO’s potential to advance realistic HOI modeling, but notes limitations such as a single performer and labor-intensive preparation, signaling directions for broader coverage and automation in future work.

Abstract

We present Human Motions with Objects (HUMOTO), a high-fidelity dataset of human-object interactions for motion generation, computer vision, and robotics applications. Featuring 735 sequences (7,875 seconds at 30 fps), HUMOTO captures interactions with 63 precisely modeled objects and 72 articulated parts. Our innovations include a scene-driven LLM scripting pipeline creating complete, purposeful tasks with natural progression, and a mocap-and-camera recording setup to effectively handle occlusions. Spanning diverse activities from cooking to outdoor picnics, HUMOTO preserves both physical accuracy and logical task flow. Professional artists rigorously clean and verify each sequence, minimizing foot sliding and object penetrations. We also provide benchmarks compared to other datasets. HUMOTO's comprehensive full-body motion and simultaneous multi-object interactions address key data-capturing challenges and provide opportunities to advance realistic human-object interaction modeling across research domains with practical applications in animation, robotics, and embodied AI systems. Project: https://jiaxin-lu.github.io/humoto/ .

HUMOTO: A 4D Dataset of Mocap Human Object Interactions

TL;DR

HUMOTO presents a high-fidelity 4D dataset of human-object interactions capturing rich hand and body motion across 63 objects and 72 parts, acquired with a multi-sensor mocap pipeline and artist-led refinement. A Scene-Driven LLM Scripting approach seeds cohesive, task-oriented interactions, while rigorous data cleaning and multi-level textual annotations enhance usability for motion generation, robotics, and vision. Quantitative and perceptual evaluations show HUMOTO achieves superior hand pose fidelity, low foot sliding, reduced penetration, and favorable interaction quality compared with prior HOI datasets. The work demonstrates HUMOTO’s potential to advance realistic HOI modeling, but notes limitations such as a single performer and labor-intensive preparation, signaling directions for broader coverage and automation in future work.

Abstract

We present Human Motions with Objects (HUMOTO), a high-fidelity dataset of human-object interactions for motion generation, computer vision, and robotics applications. Featuring 735 sequences (7,875 seconds at 30 fps), HUMOTO captures interactions with 63 precisely modeled objects and 72 articulated parts. Our innovations include a scene-driven LLM scripting pipeline creating complete, purposeful tasks with natural progression, and a mocap-and-camera recording setup to effectively handle occlusions. Spanning diverse activities from cooking to outdoor picnics, HUMOTO preserves both physical accuracy and logical task flow. Professional artists rigorously clean and verify each sequence, minimizing foot sliding and object penetrations. We also provide benchmarks compared to other datasets. HUMOTO's comprehensive full-body motion and simultaneous multi-object interactions address key data-capturing challenges and provide opportunities to advance realistic human-object interaction modeling across research domains with practical applications in animation, robotics, and embodied AI systems. Project: https://jiaxin-lu.github.io/humoto/ .

Paper Structure

This paper contains 25 sections, 8 equations, 15 figures, 2 tables.

Figures (15)

  • Figure 1: Overview of the HUMOTO dataset. The dataset contains mocap 4D human-object interaction animations with multiple objects. The unique features of the dataset include its detailed, accurate interaction modeling, specifically the detailed hand pose. The objects are precisely modeled by artists. We additionally provide different abstract levels of text annotation for the interactions.
  • Figure 2: Scene-Driven LLM Scripting. We established target scenes, prepared relevant interaction objects, and then leveraged LLMs to generate detailed action scripts.
  • Figure 3: Capture environment.Left: Overview of our capturing environment showing two Kinect cameras, stage, lighting, calibration board, and interaction objects. Right: Calibration procedure with the performer in a standardized position, enabling precise alignment between mocap suit data and camera coordinates.
  • Figure 4: 3D Meshes. Artist-modeled objects used in HUMOTO.
  • Figure 5: HUMOTO dataset visualization. We depict human-object interactions with text descriptions (left), detailed hand poses, and contact maps highlighting interaction areas (middle), and trajectories of human body parts and objects during activities (right). These complementary representations provide comprehensive data for various applications.
  • ...and 10 more figures