Table of Contents
Fetching ...

CasCalib: Cascaded Calibration for Motion Capture from Sparse Unsynchronized Cameras

James Tang, Shashwat Suri, Daniel Ajisafe, Bastian Wandt, Helge Rhodin

TL;DR

This work tackles the problem of reconstructing accurate 3D human motion from sparse, unsynchronized camera views by automatically calibrating intrinsics, extrinsics, and temporal offsets. It introduces a cascaded framework that partitions the high-dimensional calibration problem into sequential stages, solving for $N(4\times 6+1)$ parameters across cameras and refining with ICP and bundle adjustment. Key contributions include the cascade decomposition with tailored objective functions, an end-to-end pipeline that uses 2D keypoints as the sole input, and open-source code with hyperparameters for reproducibility. The approach enables practical multi-view motion capture with consumer-grade cameras, offering an automated alternative to marker-based calibration and hardware synchronization in diverse settings.

Abstract

It is now possible to estimate 3D human pose from monocular images with off-the-shelf 3D pose estimators. However, many practical applications require fine-grained absolute pose information for which multi-view cues and camera calibration are necessary. Such multi-view recordings are laborious because they require manual calibration, and are expensive when using dedicated hardware. Our goal is full automation, which includes temporal synchronization, as well as intrinsic and extrinsic camera calibration. This is done by using persons in the scene as the calibration objects. Existing methods either address only synchronization or calibration, assume one of the former as input, or have significant limitations. A common limitation is that they only consider single persons, which eases correspondence finding. We attain this generality by partitioning the high-dimensional time and calibration space into a cascade of subspaces and introduce tailored algorithms to optimize each efficiently and robustly. The outcome is an easy-to-use, flexible, and robust motion capture toolbox that we release to enable scientific applications, which we demonstrate on diverse multi-view benchmarks. Project website: https://github.com/jamestang1998/CasCalib.

CasCalib: Cascaded Calibration for Motion Capture from Sparse Unsynchronized Cameras

TL;DR

This work tackles the problem of reconstructing accurate 3D human motion from sparse, unsynchronized camera views by automatically calibrating intrinsics, extrinsics, and temporal offsets. It introduces a cascaded framework that partitions the high-dimensional calibration problem into sequential stages, solving for parameters across cameras and refining with ICP and bundle adjustment. Key contributions include the cascade decomposition with tailored objective functions, an end-to-end pipeline that uses 2D keypoints as the sole input, and open-source code with hyperparameters for reproducibility. The approach enables practical multi-view motion capture with consumer-grade cameras, offering an automated alternative to marker-based calibration and hardware synchronization in diverse settings.

Abstract

It is now possible to estimate 3D human pose from monocular images with off-the-shelf 3D pose estimators. However, many practical applications require fine-grained absolute pose information for which multi-view cues and camera calibration are necessary. Such multi-view recordings are laborious because they require manual calibration, and are expensive when using dedicated hardware. Our goal is full automation, which includes temporal synchronization, as well as intrinsic and extrinsic camera calibration. This is done by using persons in the scene as the calibration objects. Existing methods either address only synchronization or calibration, assume one of the former as input, or have significant limitations. A common limitation is that they only consider single persons, which eases correspondence finding. We attain this generality by partitioning the high-dimensional time and calibration space into a cascade of subspaces and introduce tailored algorithms to optimize each efficiently and robustly. The outcome is an easy-to-use, flexible, and robust motion capture toolbox that we release to enable scientific applications, which we demonstrate on diverse multi-view benchmarks. Project website: https://github.com/jamestang1998/CasCalib.
Paper Structure (35 sections, 19 equations, 18 figures, 11 tables)

This paper contains 35 sections, 19 equations, 18 figures, 11 tables.

Figures (18)

  • Figure 1: Cascaded calibration overview. From top to bottom, we show how we break up the optimization problem into smaller subproblems by solving for a subset of the parameters at a time, with subsequent steps refining the earlier ones. The first step is the Single View Calibration step where we estimate the normal vector $\mathbf{n}$ and the intrinsics $\mathbf{K}$. Then, we estimate the time synchronization offset $\Delta t$. Finally, with the last three steps, we estimate and refine the rotation matrix $\mathbf{R}$ and the translation $\mathbf{T}$.
  • Figure 2: System overview. A fine-grained view of the five stages in Figure \ref{['fig:overview']}, including how detections of single persons in single views are treated independently in Stage I and jointly subsequently. Variables are in the plate notation, with $n$ the number of cameras and $m$ the number of people in the scene.
  • Figure 3: 2D Reconstruction. Visual results for the single view calibration for Human3.6M Subject 1. The blue grid represents the ground plane predicted by our method with a coordinate axis defined at the bottom of the image. The green line from the ankle to the shoulders represents the ankle to shoulder keypoints.
  • Figure 4: Time synchronization. Time synchronization results between ref (red) and sync (blue) sequences for subject 1 walking sequence in Human3.6M.
  • Figure 5: Ground plane view. Bird-eye view of the ankles on the Terrace sequence, with inliers in green and outliers in red.
  • ...and 13 more figures