MOT-DETR: 3D Single Shot Detection and Tracking with Transformers to build 3D representations for Agro-Food Robots
David Rapado-Rincon, Henk Nap, Katarina Smolenova, Eldert J. van Henten, Gert Kootstra
TL;DR
This work introduces MOT-DETR, a transformer‑based method for 3D multi‑object tracking in robotic multi‑view settings, combining 2D color images and 3D point clouds to build coherent 3D representations of agro‑food environments. It adapts DETR with an additional re‑ID head and uses Hungarian matching for cross‑view data association, leveraging both self‑ and cross‑attention to fuse modalities. A key contribution is the generation of large synthetic tomato plant datasets via an L‑system, enabling effective training alongside real data and improving robustness to occlusions and viewpoint shifts. Empirical results on real greenhouse sequences and synthetic viewpoints show MOT-DETR outperforms state‑of‑the‑art MOT methods, demonstrates resilience to camera pose noise, and highlights the value of 3D information for long‑range occlusions in agricultural robotics.
Abstract
In the current demand for automation in the agro-food industry, accurately detecting and localizing relevant objects in 3D is essential for successful robotic operations. However, this is a challenge due the presence of occlusions. Multi-view perception approaches allow robots to overcome occlusions, but a tracking component is needed to associate the objects detected by the robot over multiple viewpoints. Most multi-object tracking (MOT) algorithms are designed for high frame rate sequences and struggle with the occlusions generated by robots' motions and 3D environments. In this paper, we introduce MOT-DETR, a novel approach to detect and track objects in 3D over time using a combination of convolutional networks and transformers. Our method processes 2D and 3D data, and employs a transformer architecture to perform data fusion. We show that MOT-DETR outperforms state-of-the-art multi-object tracking methods. Furthermore, we prove that MOT-DETR can leverage 3D data to deal with long-term occlusions and large frame-to-frame distances better than state-of-the-art methods. Finally, we show how our method is resilient to camera pose noise that can affect the accuracy of point clouds. The implementation of MOT-DETR can be found here: https://github.com/drapado/mot-detr
