Table of Contents
Fetching ...

Kineo: Calibration-Free Metric Motion Capture From Sparse RGB Cameras

Charles Javerliat, Pierre Raimbaud, Guillaume Lavoué

TL;DR

Kineo tackles the challenge of calibration-free, markerless multi-view motion capture with unsynchronized consumer RGB cameras. It combines a 2D-keypoint–driven SfM-style pipeline with a graph-based, distortion-aware calibration strategy that estimates intrinsics, extrinsics, and metric scale without manual setup, while producing 3D keypoints and dense scene maps at real-world scale. The approach introduces a confidence-driven keypoint sampling, a minimum spanning tree calibration, a novel 3D confidence score, and dual scale-recovery strategies (SMPL-based and metric-depth) to achieve state-of-the-art performance among calibration-free methods on EgoHumans and Human3.6M, with substantial reductions in translation and rotation errors and improved W-MPJPE. By prioritizing modular, detector-agnostic components and efficient computation, Kineo demonstrates practical applicability for long sequences and real-time scenarios on commodity hardware, with open-source releases to promote adoption. Overall, the paper delivers a robust, scalable, and accessible framework that closes much of the gap between calibration-free and calibrated motion capture while enabling real-world deployment across humans and non-human subjects.

Abstract

Markerless multiview motion capture is often constrained by the need for precise camera calibration, limiting accessibility for non-experts and in-the-wild captures. Existing calibration-free approaches mitigate this requirement but suffer from high computational cost and reduced reconstruction accuracy. We present Kineo, a fully automatic, calibration-free pipeline for markerless motion capture from videos captured by unsynchronized, uncalibrated, consumer-grade RGB cameras. Kineo leverages 2D keypoints from off-the-shelf detectors to simultaneously calibrate cameras, including Brown-Conrady distortion coefficients, and reconstruct 3D keypoints and dense scene point maps at metric scale. A confidence-driven spatio-temporal keypoint sampling strategy, combined with graph-based global optimization, ensures robust calibration at a fixed computational cost independent of sequence length. We further introduce a pairwise reprojection consensus score to quantify 3D reconstruction reliability for downstream tasks. Evaluations on EgoHumans and Human3.6M demonstrate substantial improvements over prior calibration-free methods. Compared to previous state-of-the-art approaches, Kineo reduces camera translation error by approximately 83-85%, camera angular error by 86-92%, and world mean-per-joint error (W-MPJPE) by 83-91%. Kineo is also efficient in real-world scenarios, processing multi-view sequences faster than their duration in specific configuration (e.g., 36min to process 1h20min of footage). The full pipeline and evaluation code are openly released to promote reproducibility and practical adoption at https://liris-xr.github.io/kineo/.

Kineo: Calibration-Free Metric Motion Capture From Sparse RGB Cameras

TL;DR

Kineo tackles the challenge of calibration-free, markerless multi-view motion capture with unsynchronized consumer RGB cameras. It combines a 2D-keypoint–driven SfM-style pipeline with a graph-based, distortion-aware calibration strategy that estimates intrinsics, extrinsics, and metric scale without manual setup, while producing 3D keypoints and dense scene maps at real-world scale. The approach introduces a confidence-driven keypoint sampling, a minimum spanning tree calibration, a novel 3D confidence score, and dual scale-recovery strategies (SMPL-based and metric-depth) to achieve state-of-the-art performance among calibration-free methods on EgoHumans and Human3.6M, with substantial reductions in translation and rotation errors and improved W-MPJPE. By prioritizing modular, detector-agnostic components and efficient computation, Kineo demonstrates practical applicability for long sequences and real-time scenarios on commodity hardware, with open-source releases to promote adoption. Overall, the paper delivers a robust, scalable, and accessible framework that closes much of the gap between calibration-free and calibrated motion capture while enabling real-world deployment across humans and non-human subjects.

Abstract

Markerless multiview motion capture is often constrained by the need for precise camera calibration, limiting accessibility for non-experts and in-the-wild captures. Existing calibration-free approaches mitigate this requirement but suffer from high computational cost and reduced reconstruction accuracy. We present Kineo, a fully automatic, calibration-free pipeline for markerless motion capture from videos captured by unsynchronized, uncalibrated, consumer-grade RGB cameras. Kineo leverages 2D keypoints from off-the-shelf detectors to simultaneously calibrate cameras, including Brown-Conrady distortion coefficients, and reconstruct 3D keypoints and dense scene point maps at metric scale. A confidence-driven spatio-temporal keypoint sampling strategy, combined with graph-based global optimization, ensures robust calibration at a fixed computational cost independent of sequence length. We further introduce a pairwise reprojection consensus score to quantify 3D reconstruction reliability for downstream tasks. Evaluations on EgoHumans and Human3.6M demonstrate substantial improvements over prior calibration-free methods. Compared to previous state-of-the-art approaches, Kineo reduces camera translation error by approximately 83-85%, camera angular error by 86-92%, and world mean-per-joint error (W-MPJPE) by 83-91%. Kineo is also efficient in real-world scenarios, processing multi-view sequences faster than their duration in specific configuration (e.g., 36min to process 1h20min of footage). The full pipeline and evaluation code are openly released to promote reproducibility and practical adoption at https://liris-xr.github.io/kineo/.

Paper Structure

This paper contains 40 sections, 30 equations, 13 figures, 6 tables.

Figures (13)

  • Figure 1: Overview of the Kineo pipeline for markerless motion capture from uncalibrated and unsynchronized multi-camera videos. Starting from raw video inputs, the system first performs audio-based temporal synchronization to align unsynchronized streams on a shared timeline. Next, automatic camera calibration estimates extrinsic and intrinsic parameters, including Brown–Conrady lens distortion, via a graph-based optimization over 2D keypoint correspondences selected by a confidence-driven sampling strategy. Using the recovered cameras, 3D keypoints and scene point maps are reconstructed, and each triangulated keypoint is assigned a pairwise reprojection confidence score to quantify reconstruction quality. Finally, metric-scale recovery is achieved either through a human body prior using the SMPL model or a monocular metric depth estimator for subject-agnostic scaling. The modular design enables robust and scalable reconstruction of both human and non-human subjects across long, multi-view sequences.
  • Figure 2: Example confidence-driven keypoint subsampling for a pair of views from Human3.6M. Colors indicate the pair confidence score $w_{kl}$, solid lines represent selected correspondences, and dashed lines represent correspondences that were not selected. The sampling favors keypoints for which the 2D keypoint detector is highly confident about their location (e.g., unoccluded and well-localized keypoints).
  • Figure 3: Example of camera loop closure with incorrect (left) and correct (right) relative scale factors. When the relative scales $\lambda_{ij}$ are incorrect, traversing the cycle $C_i \!\to\! C_j \!\to\! C_k \!\to\! C_i'$ does not return to the original camera position, resulting in $C_i \neq C_i'$. In contrast, when the correct scales are used, the composition of the relative transformations yields the identity, and the loop closes consistently with $C_i = C_i'$.
  • Figure 4: Comparison of scene pointmaps produced by two geometry estimation models. (a) shows the pointmap generated with MoGe, which uses single-view images and camera intrinsics to produce metric pointmaps. (b) shows the pointmap generated with VGGT, which leverages multiple views to predict a globally consistent pointmap and camera poses, aligned to our camera system. VGGT provides more spatially consistent pointmaps across views, while MoGe relies on single-view depth estimation.
  • Figure 5: Examples of different output formats with different 2D keypoints detectors used by Kineo. (a–c) show keypoint-based representations in various formats, while (d) illustrates a full parametric SMPL body reconstruction. For the SMPL model, NLF surface keypoints are first detected and subsequently fitted using the SMPLFitter module from NLF.
  • ...and 8 more figures