A Modular Pipeline for 3D Object Tracking Using RGB Cameras
Lars Bredereke, Yale Hartmann, Tanja Schultz
TL;DR
This work tackles 3D multi-object tracking with multiple time-synced RGB cameras by proposing a modular pipeline that estimates 3D trajectories even when camera poses vary between trials. It integrates a YOLO-based 2D detector, a gradient-based camera-parameter optimization to align six cameras in a common world frame, 3D object initialization via line intersections, and Extended Kalman Filter tracking to fuse detections over time, providing trajectory estimates with covariance as a confidence measure. The approach demonstrates robust performance on the Table Setting Dataset, achieving close agreement with OptiTrack for a cereal-box example and maintaining plausible trajectories across trials with missing cameras, while requiring minimal human annotation. The pipeline is scalable and adaptable to other scenes with stationary, time-synced cameras, and the accompanying data outputs and code enable reuse and broader evaluation.
Abstract
Object tracking is a key challenge of computer vision with various applications that all require different architectures. Most tracking systems have limitations such as constraining all movement to a 2D plane and they often track only one object. In this paper, we present a new modular pipeline that calculates 3D trajectories of multiple objects. It is adaptable to various settings where multiple time-synced and stationary cameras record moving objects, using off the shelf webcams. Our pipeline was tested on the Table Setting Dataset, where participants are recorded with various sensors as they set a table with tableware objects. We need to track these manipulated objects, using 6 rgb webcams. Challenges include: Detecting small objects in 9.874.699 camera frames, determining camera poses, discriminating between nearby and overlapping objects, temporary occlusions, and finally calculating a 3D trajectory using the right subset of an average of 11.12.456 pixel coordinates per 3-minute trial. We implement a robust pipeline that results in accurate trajectories with covariance of x,y,z-position as a confidence metric. It deals dynamically with appearing and disappearing objects, instantiating new Extended Kalman Filters. It scales to hundreds of table-setting trials with very little human annotation input, even with the camera poses of each trial unknown. The code is available at https://github.com/LarsBredereke/object_tracking
