Table of Contents
Fetching ...

MoBind: Motion Binding for Fine-Grained IMU-Video Pose Alignment

Duc Duy Nguyen, Tat-Jun Chin, Minh Hoai

TL;DR

MoBind tackles fine-grained IMU–video pose alignment by learning a joint representation that aligns IMU streams with skeletal motion derived from video. It introduces a hierarchical contrastive framework that first matches token-level temporal tokens, then fuses local body-partAlignments into a global motion embedding, and augments this with a Masked Token Prediction task to preserve action semantics. Evaluated on mRi, TotalCapture, and EgoHumans, MoBind demonstrates state-of-the-art performance in cross-modal retrieval, sub-second temporal synchronization, and subject/body-part localization, while remaining robust to sensor dropouts. The approach enables calibration-free synchronization and reliable multi-person grounding, with practical implications for HAR, rehabilitation, and motion analysis.

Abstract

We aim to learn a joint representation between inertial measurement unit (IMU) signals and 2D pose sequences extracted from video, enabling accurate cross-modal retrieval, temporal synchronization, subject and body-part localization, and action recognition. To this end, we introduce MoBind, a hierarchical contrastive learning framework designed to address three challenges: (1) filtering out irrelevant visual background, (2) modeling structured multi-sensor IMU configurations, and (3) achieving fine-grained, sub-second temporal alignment. To isolate motion-relevant cues, MoBind aligns IMU signals with skeletal motion sequences rather than raw pixels. We further decompose full-body motion into local body-part trajectories, pairing each with its corresponding IMU to enable semantically grounded multi-sensor alignment. To capture detailed temporal correspondence, MoBind employs a hierarchical contrastive strategy that first aligns token-level temporal segments, then fuses local (body-part) alignment with global (body-wide) motion aggregation. Evaluated on mRi, TotalCapture, and EgoHumans, MoBind consistently outperforms strong baselines across all four tasks, demonstrating robust fine-grained temporal alignment while preserving coarse semantic consistency across modalities. Code is available at https://github.com/bbvisual/ MoBind.

MoBind: Motion Binding for Fine-Grained IMU-Video Pose Alignment

TL;DR

MoBind tackles fine-grained IMU–video pose alignment by learning a joint representation that aligns IMU streams with skeletal motion derived from video. It introduces a hierarchical contrastive framework that first matches token-level temporal tokens, then fuses local body-partAlignments into a global motion embedding, and augments this with a Masked Token Prediction task to preserve action semantics. Evaluated on mRi, TotalCapture, and EgoHumans, MoBind demonstrates state-of-the-art performance in cross-modal retrieval, sub-second temporal synchronization, and subject/body-part localization, while remaining robust to sensor dropouts. The approach enables calibration-free synchronization and reliable multi-person grounding, with practical implications for HAR, rehabilitation, and motion analysis.

Abstract

We aim to learn a joint representation between inertial measurement unit (IMU) signals and 2D pose sequences extracted from video, enabling accurate cross-modal retrieval, temporal synchronization, subject and body-part localization, and action recognition. To this end, we introduce MoBind, a hierarchical contrastive learning framework designed to address three challenges: (1) filtering out irrelevant visual background, (2) modeling structured multi-sensor IMU configurations, and (3) achieving fine-grained, sub-second temporal alignment. To isolate motion-relevant cues, MoBind aligns IMU signals with skeletal motion sequences rather than raw pixels. We further decompose full-body motion into local body-part trajectories, pairing each with its corresponding IMU to enable semantically grounded multi-sensor alignment. To capture detailed temporal correspondence, MoBind employs a hierarchical contrastive strategy that first aligns token-level temporal segments, then fuses local (body-part) alignment with global (body-wide) motion aggregation. Evaluated on mRi, TotalCapture, and EgoHumans, MoBind consistently outperforms strong baselines across all four tasks, demonstrating robust fine-grained temporal alignment while preserving coarse semantic consistency across modalities. Code is available at https://github.com/bbvisual/ MoBind.
Paper Structure (17 sections, 7 equations, 7 figures, 6 tables)

This paper contains 17 sections, 7 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Proposed framework for motion binding between IMUs and 2D pose sequence from video. Contrastive learning is applied at both the local space, aligning each IMU with its corresponding body-part, and the global space, aligning full-body representations. This representation supports several downstream tasks, including cross-modal retrieval, temporal synchronization, subject and body parts localization, and human action recognition.
  • Figure 2: Overview of the proposed MoBind. The framework first encodes each IMU stream together with the motion of its corresponding body part, yielding token-level and local-level representations per sensor. These local representations are then aggregated across sensors to form global-level embeddings. The contrastive objective applies at all three levels. In addition, a Masked Token Prediction (MTP) module is used only during training to preserve coarse semantic structure, preventing the model from over-focusing on fine-grained alignment.
  • Figure 3: IMU$\rightarrow$Video retrieval results on mRi (left) and EgoHumans (right). Each example shows the query IMU signal, its corresponding ground-truth video segment, and the top three retrieved video segments. Our method successfully retrieves the ground-truth segment, and the other top-ranked results are also visually similar to the ground truth, demonstrating robust cross-modal alignment.
  • Figure 4: Per-action synchronization accuracy on EgoHumans (left) and mRi (right). MoBind achieves sub-50ms error on all EgoHumans actions and under 1s on all mRi actions, despite the challenges posed by repetitive movements and near-duplicate segments. Results confirm MoBind's robustness across diverse motion types and environments.
  • Figure 5: Examples of body-part localization on EgoHumans. Each column shows the query IMU (top) and the predicted body part with the highest similarity score (bottom), demonstrating accurate identification of sensor placement.
  • ...and 2 more figures