Table of Contents
Fetching ...

MoFM: A Large-Scale Human Motion Foundation Model

Mohammadreza Baharani, Ghazal Alinezhad Noghre, Armin Danesh Pazho, Gabriel Maldonado, Hamed Tabkhi

TL;DR

MoFM presents a large-scale Motion Foundation Model that learns semantic representations of human motion by discretizing spatio-temporal heatmaps into a MotionBook dictionary via a custom discrete Variational Encoder-Decoder. The framework performs pose-aware masked self-supervision on a ViT-style backbone, enabling task-agnostic pretraining that transfers effectively to action recognition and anomaly detection, including one-shot settings. Key contributions include the dVED for motion discretization, the MotionBook vocabulary (size $T$), and a self-supervised training paradigm that yields a versatile backbone capable of handling DT1–DT4 with simple task heads. This approach offers scalable, generalizable motion understanding with practical impact on surveillance, healthcare, and human-robot interaction, by providing a reusable, discrete-token representation for complex spatio-temporal motion.

Abstract

Foundation Models (FM) have increasingly drawn the attention of researchers due to their scalability and generalization across diverse tasks. Inspired by the success of FMs and the principles that have driven advancements in Large Language Models (LLMs), we introduce MoFM as a novel Motion Foundation Model. MoFM is designed for the semantic understanding of complex human motions in both time and space. To facilitate large-scale training, MotionBook, a comprehensive human motion dictionary of discretized motions is designed and employed. MotionBook utilizes Thermal Cubes to capture spatio-temporal motion heatmaps, applying principles from discrete variational models to encode human movements into discrete units for a more efficient and scalable representation. MoFM, trained on a large corpus of motion data, provides a foundational backbone adaptable to diverse downstream tasks, supporting paradigms such as one-shot, unsupervised, and supervised tasks. This versatility makes MoFM well-suited for a wide range of motion-based applications.

MoFM: A Large-Scale Human Motion Foundation Model

TL;DR

MoFM presents a large-scale Motion Foundation Model that learns semantic representations of human motion by discretizing spatio-temporal heatmaps into a MotionBook dictionary via a custom discrete Variational Encoder-Decoder. The framework performs pose-aware masked self-supervision on a ViT-style backbone, enabling task-agnostic pretraining that transfers effectively to action recognition and anomaly detection, including one-shot settings. Key contributions include the dVED for motion discretization, the MotionBook vocabulary (size ), and a self-supervised training paradigm that yields a versatile backbone capable of handling DT1–DT4 with simple task heads. This approach offers scalable, generalizable motion understanding with practical impact on surveillance, healthcare, and human-robot interaction, by providing a reusable, discrete-token representation for complex spatio-temporal motion.

Abstract

Foundation Models (FM) have increasingly drawn the attention of researchers due to their scalability and generalization across diverse tasks. Inspired by the success of FMs and the principles that have driven advancements in Large Language Models (LLMs), we introduce MoFM as a novel Motion Foundation Model. MoFM is designed for the semantic understanding of complex human motions in both time and space. To facilitate large-scale training, MotionBook, a comprehensive human motion dictionary of discretized motions is designed and employed. MotionBook utilizes Thermal Cubes to capture spatio-temporal motion heatmaps, applying principles from discrete variational models to encode human movements into discrete units for a more efficient and scalable representation. MoFM, trained on a large corpus of motion data, provides a foundational backbone adaptable to diverse downstream tasks, supporting paradigms such as one-shot, unsupervised, and supervised tasks. This versatility makes MoFM well-suited for a wide range of motion-based applications.

Paper Structure

This paper contains 20 sections, 6 equations, 4 figures, 9 tables.

Figures (4)

  • Figure 1: Visualization of videos with corresponding pose tokens normalized by vocabulary size. Each row shows skeletal motion frames alongside tokens, with values mapped from blue (low) to red (high), illustrating the alignment of skeletal actions with motion vocabulary.
  • Figure 2: Architecture of the proposed custom dVED encoder and decoder. The variable r represents the number of layers in which the 2D/3D residual block can be repeated.
  • Figure 3: Comparison of heatmap skeletons: (a) Ground truth heatmap skeleton used as input for dVED; (b) Reconstructed heatmap skeleton generated by dVED. A ghosting effect is observed for moving joints in the dVED output.
  • Figure 4: Overview of Motion Foundation Model (MoFM). Poses are converted into heatmap representations using a Gaussian function (\ref{['eq:guss_factor']}), producing a series of thermal cubes. Before pre-training, we train the custom dVED model for reconstruction. This involves tokenizing a series of heatmap cubes in both spatial and temporal dimensions according to a learned vocabulary. After cubing, tokens are masked keypoint-wise with a special mask embedding [M]. The resulting $\{C_i^{m}\}_{i=0}^{K-1}$ masked cubes are then fed into a vision transformer encoder. The backbone predicts the visual tokens of the discretized image based on $\{z_i\}_{i=0}^{K-1}$ generated by dVED.