Table of Contents
Fetching ...

Humans as Checkerboards: Calibrating Camera Motion Scale for World-Coordinate Human Mesh Recovery

Fengyuan Yang, Kerui Gu, Ha Linh Nguyen, Tze Ho Elden Tse, Angela Yao

TL;DR

This work tackles the persistent problem of unknown scale in monocular SLAM for world-coordinate human mesh recovery. It introduces HAC, an optimization-free framework that uses the absolute depth of human joints predicted by HMR as calibration references to directly recover global camera and human motion in world coordinates. By linking the HMR-predicted depth of human-ground contacts to SLAM’s relative scene depth and employing a ground-plane fallback for out-of-view cases, HAC achieves state-of-the-art global motion accuracy while dramatically reducing computation time (roughly 100x faster than optimization-based methods). The approach demonstrates robust performance across diverse datasets and SLAM/HMR backbones, enabling scalable, real-time-like global human motion estimation in challenging video conditions.

Abstract

Accurate camera motion estimation is essential for recovering global human motion in world coordinates from RGB video inputs. SLAM is widely used for estimating camera trajectory and point cloud, but monocular SLAM does so only up to an unknown scale factor. Previous works estimate the scale factor through optimization, but this is unreliable and time-consuming. This paper presents an optimization-free scale calibration framework, Human as Checkerboard (HAC). HAC innovatively leverages the human body predicted by human mesh recovery model as a calibration reference. Specifically, it uses the absolute depth of human-scene contact joints as references to calibrate the corresponding relative scene depth from SLAM. HAC benefits from geometric priors encoded in human mesh recovery models to estimate the SLAM scale and achieves precise global human motion estimation. Simple yet powerful, our method sets a new state-of-the-art performance for global human mesh estimation tasks, reducing motion errors by 50% over prior local-to-global methods while using 100$\times$ less inference time than optimization-based methods. Project page: https://martayang.github.io/HAC.

Humans as Checkerboards: Calibrating Camera Motion Scale for World-Coordinate Human Mesh Recovery

TL;DR

This work tackles the persistent problem of unknown scale in monocular SLAM for world-coordinate human mesh recovery. It introduces HAC, an optimization-free framework that uses the absolute depth of human joints predicted by HMR as calibration references to directly recover global camera and human motion in world coordinates. By linking the HMR-predicted depth of human-ground contacts to SLAM’s relative scene depth and employing a ground-plane fallback for out-of-view cases, HAC achieves state-of-the-art global motion accuracy while dramatically reducing computation time (roughly 100x faster than optimization-based methods). The approach demonstrates robust performance across diverse datasets and SLAM/HMR backbones, enabling scalable, real-time-like global human motion estimation in challenging video conditions.

Abstract

Accurate camera motion estimation is essential for recovering global human motion in world coordinates from RGB video inputs. SLAM is widely used for estimating camera trajectory and point cloud, but monocular SLAM does so only up to an unknown scale factor. Previous works estimate the scale factor through optimization, but this is unreliable and time-consuming. This paper presents an optimization-free scale calibration framework, Human as Checkerboard (HAC). HAC innovatively leverages the human body predicted by human mesh recovery model as a calibration reference. Specifically, it uses the absolute depth of human-scene contact joints as references to calibrate the corresponding relative scene depth from SLAM. HAC benefits from geometric priors encoded in human mesh recovery models to estimate the SLAM scale and achieves precise global human motion estimation. Simple yet powerful, our method sets a new state-of-the-art performance for global human mesh estimation tasks, reducing motion errors by 50% over prior local-to-global methods while using 100 less inference time than optimization-based methods. Project page: https://martayang.github.io/HAC.
Paper Structure (26 sections, 7 equations, 11 figures, 6 tables)

This paper contains 26 sections, 7 equations, 11 figures, 6 tables.

Figures (11)

  • Figure 1: (a) Video sequence as an entanglement of the camera and human motion in the world coordinate. (b) and (c) Local-to-global methods like WHAM WHAM_2023 are time-efficient but fail in ambiguous cases; optimization-based methods like SLAHMR SLAMHR_2023 struggle to optimize a good trajectory and are time-consuming. In contrast, our method achieves accurate trajectories without optimization.
  • Figure 2: Error rate boxplot on the HMR predicted metric depth and the global human translation estimation by WHAM WHAM_2023. We show that HMR depth consistently exhibits a significantly lower error rate compared to global translation, which validates our concept of using humans as checkerboards for scale calibration.
  • Figure 3: Overall pipeline of HAC. Given a monocular input video, we use SLAM to estimate camera motion and scene reconstruction at an arbitrary scale. Concurrently, we predict local human mesh using an HMR model. We then use the metric depth of contact joints from HMR to calibrate the SLAM scale. With this approach, we can accurately decouple global human and camera motion in world coordinates.
  • Figure 4: Detail of our scale calibration process. The scale is calibrated by comparing the depth from the camera $\mathbf{o}$ to the contact joint $\mathbf{J}^{w}_{p}$ predicted by HMR with the depth to the corresponding reference point $p$ in the scene point cloud. There are two possible scenarios for the reference point $p$: (a) when the contact joint is visible, $p$ is obtained as the intersection with the point cloud; (b) when the contact joint is not visible (e.g., no intersection or incorrect intersection), $p$ is determined as the intersection with the estimated ground plane.
  • Figure 5: Comparison of global human trajectory estimation on EMDB. Overall, ours shows better alignment to ground truth data compared to WHAM and TRAM.
  • ...and 6 more figures