Table of Contents
Fetching ...

Ego4o: Egocentric Human Motion Capture and Understanding from Multi-Modal Input

Jian Wang, Rishabh Dabral, Diogo Luvizon, Zhe Cao, Lingjie Liu, Thabo Beeler, Christian Theobalt

TL;DR

Ego4o addresses egocentric motion capture and understanding from flexible, multi-modal inputs provided by consumer wearables. It fuses a part-aware VQ-VAE for discrete motion representation with a multi-modal transformer encoder and an Ego4o-LLM for motion understanding, incorporating test-time optimization to refine predictions. The approach demonstrates improved MoCap accuracy and motion description quality across DIP-IMU and Nymeria datasets, and shows that generated textual descriptions can further enhance motion capture when ground-truth text is unavailable. By enabling robust operation with varying IMU placements and optional imagery/text, Ego4o advances practical, sensor-rich motion analysis for everyday applications leveraging ubiquitous wearables.

Abstract

This work focuses on tracking and understanding human motion using consumer wearable devices, such as VR/AR headsets, smart glasses, cellphones, and smartwatches. These devices provide diverse, multi-modal sensor inputs, including egocentric images, and 1-3 sparse IMU sensors in varied combinations. Motion descriptions can also accompany these signals. The diverse input modalities and their intermittent availability pose challenges for consistent motion capture and understanding. In this work, we present Ego4o (o for omni), a new framework for simultaneous human motion capture and understanding from multi-modal egocentric inputs. This method maintains performance with partial inputs while achieving better results when multiple modalities are combined. First, the IMU sensor inputs, the optional egocentric image, and text description of human motion are encoded into the latent space of a motion VQ-VAE. Next, the latent vectors are sent to the VQ-VAE decoder and optimized to track human motion. When motion descriptions are unavailable, the latent vectors can be input into a multi-modal LLM to generate human motion descriptions, which can further enhance motion capture accuracy. Quantitative and qualitative evaluations demonstrate the effectiveness of our method in predicting accurate human motion and high-quality motion descriptions.

Ego4o: Egocentric Human Motion Capture and Understanding from Multi-Modal Input

TL;DR

Ego4o addresses egocentric motion capture and understanding from flexible, multi-modal inputs provided by consumer wearables. It fuses a part-aware VQ-VAE for discrete motion representation with a multi-modal transformer encoder and an Ego4o-LLM for motion understanding, incorporating test-time optimization to refine predictions. The approach demonstrates improved MoCap accuracy and motion description quality across DIP-IMU and Nymeria datasets, and shows that generated textual descriptions can further enhance motion capture when ground-truth text is unavailable. By enabling robust operation with varying IMU placements and optional imagery/text, Ego4o advances practical, sensor-rich motion analysis for everyday applications leveraging ubiquitous wearables.

Abstract

This work focuses on tracking and understanding human motion using consumer wearable devices, such as VR/AR headsets, smart glasses, cellphones, and smartwatches. These devices provide diverse, multi-modal sensor inputs, including egocentric images, and 1-3 sparse IMU sensors in varied combinations. Motion descriptions can also accompany these signals. The diverse input modalities and their intermittent availability pose challenges for consistent motion capture and understanding. In this work, we present Ego4o (o for omni), a new framework for simultaneous human motion capture and understanding from multi-modal egocentric inputs. This method maintains performance with partial inputs while achieving better results when multiple modalities are combined. First, the IMU sensor inputs, the optional egocentric image, and text description of human motion are encoded into the latent space of a motion VQ-VAE. Next, the latent vectors are sent to the VQ-VAE decoder and optimized to track human motion. When motion descriptions are unavailable, the latent vectors can be input into a multi-modal LLM to generate human motion descriptions, which can further enhance motion capture accuracy. Quantitative and qualitative evaluations demonstrate the effectiveness of our method in predicting accurate human motion and high-quality motion descriptions.

Paper Structure

This paper contains 35 sections, 4 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Our method can use an egocentric image and 1-3 IMU sensors from wearable devices to accurately predict human motion and generate motion descriptions. Motion descriptions, when available, can also enhance motion capture accuracy. Ego4o supports flexible input combinations, functioning with or without images, or with varied IMU placements.
  • Figure 2: Overview of our Ego4o framework. We first train a VQ-VAE (purple blocks) to learn the part-aware motion codebook (\ref{['method:vqvae']}). For motion capture (green blocks), the system processes IMU sensor data, egocentric images, and motion descriptions through a multi-modal encoder to generate motion codes in the codebook. These codes are then decoded to predict human motion (\ref{['method:mocap']}). For motion understanding (blue blocks), the system combines motion codes and egocentric images in a finetuned LLM to generate motion descriptions (\ref{['method:motion_understanding']}), which can be fed back to enhance motion capture accuracy.
  • Figure 3: Egocentric Human Motion Understanding. Each modality is encoded separately and then concatenated in the order specified by the input instruction $X_\text{ins}$ before being fed into the LLM.
  • Figure 4: Comparison of human motion capture results between Ego4o, Ego4o-IMU and IMUPoser mollyn2023imuposer on the DIP-IMU huang2018deep (left) and Nymeria dataset ma2024nymeria (right). The red skeleton is the ground truth, while the green skeleton is the predicted pose. Our predictions are more accurate than the baselines when only using IMU input, and using egocentric images and motion descriptions improves the performance.
  • Figure 5: Quantitative results of human motion capture on Nymeria dataset. The result compares our method with IMUPoser under different IMU setups. H, LP, RP, LW, and RW indicate the IMU located on different body parts. H: head, LP: left hip, RP: right hip, LW: left wrist, RW: right wrist.
  • ...and 2 more figures