Table of Contents
Fetching ...

OnlineHMR: Video-based Online World-Grounded Human Mesh Recovery

Yiwen Zhao, Ce Zheng, Yufu Wang, Hsueh-Han Daniel Yang, Liting Wen, Laszlo A. Jeni

Abstract

Human mesh recovery (HMR) models 3D human body from monocular videos, with recent works extending it to world-coordinate human trajectory and motion reconstruction. However, most existing methods remain offline, relying on future frames or global optimization, which limits their applicability in interactive feedback and perception-action loop scenarios such as AR/VR and telepresence. To address this, we propose OnlineHMR, a fully online framework that jointly satisfies four essential criteria of online processing, including system-level causality, faithfulness, temporal consistency, and efficiency. Built upon a two-branch architecture, OnlineHMR enables streaming inference via a causal key-value cache design and a curated sliding-window learning strategy. Meanwhile, a human-centric incremental SLAM provides online world-grounded alignment under physically plausible trajectory correction. Experimental results show that our method achieves performance comparable to existing chunk-based approaches on the standard EMDB benchmark and highly dynamic custom videos, while uniquely supporting online processing. Page and code are available at https://tsukasane.github.io/Video-OnlineHMR/.

OnlineHMR: Video-based Online World-Grounded Human Mesh Recovery

Abstract

Human mesh recovery (HMR) models 3D human body from monocular videos, with recent works extending it to world-coordinate human trajectory and motion reconstruction. However, most existing methods remain offline, relying on future frames or global optimization, which limits their applicability in interactive feedback and perception-action loop scenarios such as AR/VR and telepresence. To address this, we propose OnlineHMR, a fully online framework that jointly satisfies four essential criteria of online processing, including system-level causality, faithfulness, temporal consistency, and efficiency. Built upon a two-branch architecture, OnlineHMR enables streaming inference via a causal key-value cache design and a curated sliding-window learning strategy. Meanwhile, a human-centric incremental SLAM provides online world-grounded alignment under physically plausible trajectory correction. Experimental results show that our method achieves performance comparable to existing chunk-based approaches on the standard EMDB benchmark and highly dynamic custom videos, while uniquely supporting online processing. Page and code are available at https://tsukasane.github.io/Video-OnlineHMR/.
Paper Structure (26 sections, 24 equations, 14 figures, 8 tables)

This paper contains 26 sections, 24 equations, 14 figures, 8 tables.

Figures (14)

  • Figure 1: An in-the-wild example of our framework. Given a streaming monocular RGB video, our method leverages a two-branch inference to recover the world-grounded human motion in an online manner.
  • Figure 2: The online processing workflow of our method. Given a streaming video input, we estimate the world coordinate 3D human body of the most recent frame. $\mathbf{T}_i$ here is the homogeneous transformation matirx composed by $\mathbf{q}_i^{\text{c}}$ and $\mathbf{t}_i^{\text{c}}$ in Sec. \ref{['sec:method3.1']}.
  • Figure 3: Sliding window learning pipeline. The input sequence is sliced to overlapping windows, learning spatial and temporal information fusion inside each window, and alleviate jitter effect through velocity regularization.
  • Figure 4: Quantitative comparison of OnlineHMR and Human3R on the same EMDB-2 video with ground truth after world coordinate alignments.
  • Figure 5: Quantitative comparison of OnlineHMR and Human3R on the same EMDB-2 video with ground truth after world coordinate alignments.
  • ...and 9 more figures