Table of Contents
Fetching ...

3D Ground Truth Reconstruction from Multi-Camera Annotations Using UKF

Linh Van Ma, Unse Fatima, Tepy Sokun Chriv, Haroon Imran, Moongu Jeon

TL;DR

This work tackles the challenge of producing accurate 3D ground truth annotations from 2D multi-view annotations by fusing observations from calibrated cameras with an Unscented Kalman Filter. It introduces a multi-view single-object tracking framework that represents objects as 3D ellipsoids and uses a homography-based projection to connect the 3D state with 2D bounding boxes and keypoints across views. The main contributions are the UKF-based fusion, a principled initialization strategy, and demonstrated high accuracy across multiple datasets (Wildtrack, MultiviewX, CMC, Panoptic) for both 3D localization and pose estimation. This approach enables scalable, automatic generation of rich 3D ground truth without dense 3D detections or manual labeling, with practical impact for autonomous systems and vision research. The method shows strong performance under occlusion and viewpoint changes, though it relies on overlapping fields of view and precise camera calibration.

Abstract

Accurate 3D ground truth estimation is critical for applications such as autonomous navigation, surveillance, and robotics. This paper introduces a novel method that uses an Unscented Kalman Filter (UKF) to fuse 2D bounding box or pose keypoint ground truth annotations from multiple calibrated cameras into accurate 3D ground truth. By leveraging human-annotated ground-truth 2D, our proposed method, a multi-camera single-object tracking algorithm, transforms 2D image coordinates into robust 3D world coordinates through homography-based projection and UKF-based fusion. Our proposed algorithm processes multi-view data to estimate object positions and shapes while effectively handling challenges such as occlusion. We evaluate our method on the CMC, Wildtrack, and Panoptic datasets, demonstrating high accuracy in 3D localization compared to the available 3D ground truth. Unlike existing approaches that provide only ground-plane information, our method also outputs the full 3D shape of each object. Additionally, the algorithm offers a scalable and fully automatic solution for multi-camera systems using only 2D image annotations.

3D Ground Truth Reconstruction from Multi-Camera Annotations Using UKF

TL;DR

This work tackles the challenge of producing accurate 3D ground truth annotations from 2D multi-view annotations by fusing observations from calibrated cameras with an Unscented Kalman Filter. It introduces a multi-view single-object tracking framework that represents objects as 3D ellipsoids and uses a homography-based projection to connect the 3D state with 2D bounding boxes and keypoints across views. The main contributions are the UKF-based fusion, a principled initialization strategy, and demonstrated high accuracy across multiple datasets (Wildtrack, MultiviewX, CMC, Panoptic) for both 3D localization and pose estimation. This approach enables scalable, automatic generation of rich 3D ground truth without dense 3D detections or manual labeling, with practical impact for autonomous systems and vision research. The method shows strong performance under occlusion and viewpoint changes, though it relies on overlapping fields of view and precise camera calibration.

Abstract

Accurate 3D ground truth estimation is critical for applications such as autonomous navigation, surveillance, and robotics. This paper introduces a novel method that uses an Unscented Kalman Filter (UKF) to fuse 2D bounding box or pose keypoint ground truth annotations from multiple calibrated cameras into accurate 3D ground truth. By leveraging human-annotated ground-truth 2D, our proposed method, a multi-camera single-object tracking algorithm, transforms 2D image coordinates into robust 3D world coordinates through homography-based projection and UKF-based fusion. Our proposed algorithm processes multi-view data to estimate object positions and shapes while effectively handling challenges such as occlusion. We evaluate our method on the CMC, Wildtrack, and Panoptic datasets, demonstrating high accuracy in 3D localization compared to the available 3D ground truth. Unlike existing approaches that provide only ground-plane information, our method also outputs the full 3D shape of each object. Additionally, the algorithm offers a scalable and fully automatic solution for multi-camera systems using only 2D image annotations.

Paper Structure

This paper contains 7 sections, 4 equations, 4 figures, 2 tables, 1 algorithm.

Figures (4)

  • Figure 1: An illustration for the 3D annotation of our method. Our output is a 3D ellipsoid representing the height, width, and length of an object. In contrast, a big red point on the ground plane is the output of previous methods such as Wildtrack chavdarova2018wildtrack and MultiviewX hou2020multiview.
  • Figure 2: Single object tracking using an Unscented Kalman Filter (UKF) to estimate 3D location, shape, and keypoints from measurements, including bounding boxes and keypoints, obtained from multiple camera images. We assume a target born at time step $0$, it survives and evaluates from time step $k$ to $k+1$.
  • Figure 3: We represent an object as an ellipsoid and employ the unscented transform to generate $(2d) + 1$ ellipsoids (van2004sigmajulier2004unscented), capturing the object's uncertainty in 3D space, where $d$ denotes the dimension of the object state. Each ellipsoid is then projected onto a 2D image as a bounding box using a camera matrix.
  • Figure 4: Illustration of the annotation error for object ID 358 in the Wildtrack dataset. The object disappears at frame 945 and reappears at frame 950 in a distant and inconsistent location, resulting in an implausible jump in Wildtrack dataset chavdarova2018wildtrack.