3D Ground Truth Reconstruction from Multi-Camera Annotations Using UKF
Linh Van Ma, Unse Fatima, Tepy Sokun Chriv, Haroon Imran, Moongu Jeon
TL;DR
This work tackles the challenge of producing accurate 3D ground truth annotations from 2D multi-view annotations by fusing observations from calibrated cameras with an Unscented Kalman Filter. It introduces a multi-view single-object tracking framework that represents objects as 3D ellipsoids and uses a homography-based projection to connect the 3D state with 2D bounding boxes and keypoints across views. The main contributions are the UKF-based fusion, a principled initialization strategy, and demonstrated high accuracy across multiple datasets (Wildtrack, MultiviewX, CMC, Panoptic) for both 3D localization and pose estimation. This approach enables scalable, automatic generation of rich 3D ground truth without dense 3D detections or manual labeling, with practical impact for autonomous systems and vision research. The method shows strong performance under occlusion and viewpoint changes, though it relies on overlapping fields of view and precise camera calibration.
Abstract
Accurate 3D ground truth estimation is critical for applications such as autonomous navigation, surveillance, and robotics. This paper introduces a novel method that uses an Unscented Kalman Filter (UKF) to fuse 2D bounding box or pose keypoint ground truth annotations from multiple calibrated cameras into accurate 3D ground truth. By leveraging human-annotated ground-truth 2D, our proposed method, a multi-camera single-object tracking algorithm, transforms 2D image coordinates into robust 3D world coordinates through homography-based projection and UKF-based fusion. Our proposed algorithm processes multi-view data to estimate object positions and shapes while effectively handling challenges such as occlusion. We evaluate our method on the CMC, Wildtrack, and Panoptic datasets, demonstrating high accuracy in 3D localization compared to the available 3D ground truth. Unlike existing approaches that provide only ground-plane information, our method also outputs the full 3D shape of each object. Additionally, the algorithm offers a scalable and fully automatic solution for multi-camera systems using only 2D image annotations.
