Table of Contents
Fetching ...

EgoLM: Multi-Modal Language Model of Egocentric Motions

Fangzhou Hong, Vladimir Guzov, Hyo Jin Kim, Yuting Ye, Richard Newcombe, Ziwei Liu, Lingni Ma

TL;DR

EgoLM is introduced, a versatile framework designed for egocentric motion understanding using multi-modal data that unifies a range of motion understanding tasks, including motion narration from video or motion data, as well as motion generation from text or sparse sensor data.

Abstract

As the prevalence of wearable devices, learning egocentric motions becomes essential to develop contextual AI. In this work, we present EgoLM, a versatile framework that tracks and understands egocentric motions from multi-modal inputs, e.g., egocentric videos and motion sensors. EgoLM exploits rich contexts for the disambiguation of egomotion tracking and understanding, which are ill-posed under single modality conditions. To facilitate the versatile and multi-modal framework, our key insight is to model the joint distribution of egocentric motions and natural languages using large language models (LLM). Multi-modal sensor inputs are encoded and projected to the joint latent space of language models, and used to prompt motion generation or text generation for egomotion tracking or understanding, respectively. Extensive experiments on large-scale multi-modal human motion dataset validate the effectiveness of EgoLM as a generalist model for universal egocentric learning.

EgoLM: Multi-Modal Language Model of Egocentric Motions

TL;DR

EgoLM is introduced, a versatile framework designed for egocentric motion understanding using multi-modal data that unifies a range of motion understanding tasks, including motion narration from video or motion data, as well as motion generation from text or sparse sensor data.

Abstract

As the prevalence of wearable devices, learning egocentric motions becomes essential to develop contextual AI. In this work, we present EgoLM, a versatile framework that tracks and understands egocentric motions from multi-modal inputs, e.g., egocentric videos and motion sensors. EgoLM exploits rich contexts for the disambiguation of egomotion tracking and understanding, which are ill-posed under single modality conditions. To facilitate the versatile and multi-modal framework, our key insight is to model the joint distribution of egocentric motions and natural languages using large language models (LLM). Multi-modal sensor inputs are encoded and projected to the joint latent space of language models, and used to prompt motion generation or text generation for egomotion tracking or understanding, respectively. Extensive experiments on large-scale multi-modal human motion dataset validate the effectiveness of EgoLM as a generalist model for universal egocentric learning.
Paper Structure (23 sections, 3 equations, 18 figures, 3 tables)

This paper contains 23 sections, 3 equations, 18 figures, 3 tables.

Figures (18)

  • Figure 1: We propose EgoLM, a multi-modal language model that unifies egocentric motion tracking and understanding from wearable sensor data, e.g., sparse motion sensors and egocentric videos.
  • Figure 2: Overview of EgoLM. Three steps are designed for the training of EgoLM, i.e., motion tokenizer training, motion pre-training and multi-modal instruction tuning.
  • Figure 3: Details of Multi-Modal Instruction Tuning. Different modalities are encoded separately. Their features are concatenated in the order of the instruction template and input into the transformer layers of the language model.
  • Figure 4: Qualitative Results of Three-Points Motion Tracking. Skeletons are color-coded by the joint position errors. Baseline methods only use three-points as inputs. Ours uses three-points and egocentric videos as inputs.
  • Figure 5: Qualitative Results of One-Point Motion Tracking. Skeletons are color-coded by joint position errors. EgoEgo only uses one-point as inputs. Ours includes egocentric videos as inputs.
  • ...and 13 more figures