Spatial-Related Sensors Matters: 3D Human Motion Reconstruction Assisted with Textual Semantics

Xueyuan Yang; Chao Yao; Xiaojuan Ban

Spatial-Related Sensors Matters: 3D Human Motion Reconstruction Assisted with Textual Semantics

Xueyuan Yang, Chao Yao, Xiaojuan Ban

TL;DR

The paper addresses the challenge of reconstructing 3D human poses from sparse IMUs, an under-constrained problem prone to ambiguity. It introduces a multimodal pipeline that fuses sensor data with textual semantics, featuring Uncertainty-guided Spatial Attention (UGSA), a Hierarchical Temporal Transformer (HTT), and text-sensor contrastive learning to align modalities. Key contributions include uncertainty-based sensor resampling, spatially aware sensor relationships that account for sensor reliability, and cross-modal temporal alignment that resolves ambiguities such as sitting versus standing, yielding more natural motion. Experiments on Totalcapture, DIP-IMU, and Babel-annotated data demonstrate state-of-the-art pose accuracy and robust performance in both offline and real-time settings, highlighting practical impact for wearable motion capture.

Abstract

Leveraging wearable devices for motion reconstruction has emerged as an economical and viable technique. Certain methodologies employ sparse Inertial Measurement Units (IMUs) on the human body and harness data-driven strategies to model human poses. However, the reconstruction of motion based solely on sparse IMUs data is inherently fraught with ambiguity, a consequence of numerous identical IMU readings corresponding to different poses. In this paper, we explore the spatial importance of multiple sensors, supervised by text that describes specific actions. Specifically, uncertainty is introduced to derive weighted features for each IMU. We also design a Hierarchical Temporal Transformer (HTT) and apply contrastive learning to achieve precise temporal and feature alignment of sensor data with textual semantics. Experimental results demonstrate our proposed approach achieves significant improvements in multiple metrics compared to existing methods. Notably, with textual supervision, our method not only differentiates between ambiguous actions such as sitting and standing but also produces more precise and natural motion.

Spatial-Related Sensors Matters: 3D Human Motion Reconstruction Assisted with Textual Semantics

TL;DR

Abstract

Paper Structure (17 sections, 8 equations, 7 figures, 3 tables)

This paper contains 17 sections, 8 equations, 7 figures, 3 tables.

Introduction
Related Work
Sensor-based Human Motion Reconstruction
Textual Semantics in Human Motion Field
Method
Text Encoder
Sensor Encoder
Text-Sensor Fusion Module
Losses
Experiment
Dataset Setting
Metric
Training Details
Comparisons
Ablation
...and 2 more sections

Figures (7)

Figure 1: Considering specific postures such as standing and sitting, the rotational data and acceleration output by the sensors are largely invariant. Incorporating additional information such as text can help to address this challenge.
Figure 2: Overview of our method. Our model encapsulates three distinct encoders: a Text Encoder, a Sensor Encoder, and a Text-Sensor Fusion Module. The details of the Sensor Encoder and the Hierarchical Temporal Transformer module are illustrated on the right. The schematic of the model output is adapted from BABEL:CVPR:2021.
Figure 3: An illustration of window self-attention (left) and shifted window self-attention (right).
Figure 4: An efficient methodology for batch computation of self-attention within the context of shifted window partitioning.
Figure 5: Mesh error distribution and qualitative comparisons between our method (with/without text) and Transpose. The text description of the motion is provided below, with the sequence label illustrated in green and the frame label presented in blue.
...and 2 more figures

Spatial-Related Sensors Matters: 3D Human Motion Reconstruction Assisted with Textual Semantics

TL;DR

Abstract

Spatial-Related Sensors Matters: 3D Human Motion Reconstruction Assisted with Textual Semantics

Authors

TL;DR

Abstract

Table of Contents

Figures (7)