OnlineHMR: Video-based Online World-Grounded Human Mesh Recovery

Yiwen Zhao; Ce Zheng; Yufu Wang; Hsueh-Han Daniel Yang; Liting Wen; Laszlo A. Jeni

OnlineHMR: Video-based Online World-Grounded Human Mesh Recovery

Yiwen Zhao, Ce Zheng, Yufu Wang, Hsueh-Han Daniel Yang, Liting Wen, Laszlo A. Jeni

Abstract

Human mesh recovery (HMR) models 3D human body from monocular videos, with recent works extending it to world-coordinate human trajectory and motion reconstruction. However, most existing methods remain offline, relying on future frames or global optimization, which limits their applicability in interactive feedback and perception-action loop scenarios such as AR/VR and telepresence. To address this, we propose OnlineHMR, a fully online framework that jointly satisfies four essential criteria of online processing, including system-level causality, faithfulness, temporal consistency, and efficiency. Built upon a two-branch architecture, OnlineHMR enables streaming inference via a causal key-value cache design and a curated sliding-window learning strategy. Meanwhile, a human-centric incremental SLAM provides online world-grounded alignment under physically plausible trajectory correction. Experimental results show that our method achieves performance comparable to existing chunk-based approaches on the standard EMDB benchmark and highly dynamic custom videos, while uniquely supporting online processing. Page and code are available at https://tsukasane.github.io/Video-OnlineHMR/.

OnlineHMR: Video-based Online World-Grounded Human Mesh Recovery

Abstract

Paper Structure (26 sections, 24 equations, 14 figures, 8 tables)

This paper contains 26 sections, 24 equations, 14 figures, 8 tables.

Introduction
Related Works
Image and Video-based Human Mesh Recovery
Human-Camera-Scene Joint Modeling
Streaming Inference
Methodology
Parametric Human Model Preliminary
Camera Coordinates Online HMR
Human Centric Incremental SLAM
Frequency Domain Metric
Experiment
Datasets and Metrics
Results Analysis
Ablation Study
Conclusion
...and 11 more sections

Figures (14)

Figure 1: An in-the-wild example of our framework. Given a streaming monocular RGB video, our method leverages a two-branch inference to recover the world-grounded human motion in an online manner.
Figure 2: The online processing workflow of our method. Given a streaming video input, we estimate the world coordinate 3D human body of the most recent frame. $\mathbf{T}_i$ here is the homogeneous transformation matirx composed by $\mathbf{q}_i^{\text{c}}$ and $\mathbf{t}_i^{\text{c}}$ in Sec. \ref{['sec:method3.1']}.
Figure 3: Sliding window learning pipeline. The input sequence is sliced to overlapping windows, learning spatial and temporal information fusion inside each window, and alleviate jitter effect through velocity regularization.
Figure 4: Quantitative comparison of OnlineHMR and Human3R on the same EMDB-2 video with ground truth after world coordinate alignments.
Figure 5: Quantitative comparison of OnlineHMR and Human3R on the same EMDB-2 video with ground truth after world coordinate alignments.
...and 9 more figures

OnlineHMR: Video-based Online World-Grounded Human Mesh Recovery

Abstract

OnlineHMR: Video-based Online World-Grounded Human Mesh Recovery

Authors

Abstract

Table of Contents

Figures (14)