MOT-DETR: 3D Single Shot Detection and Tracking with Transformers to build 3D representations for Agro-Food Robots

David Rapado-Rincon; Henk Nap; Katarina Smolenova; Eldert J. van Henten; Gert Kootstra

MOT-DETR: 3D Single Shot Detection and Tracking with Transformers to build 3D representations for Agro-Food Robots

David Rapado-Rincon, Henk Nap, Katarina Smolenova, Eldert J. van Henten, Gert Kootstra

TL;DR

This work introduces MOT-DETR, a transformer‑based method for 3D multi‑object tracking in robotic multi‑view settings, combining 2D color images and 3D point clouds to build coherent 3D representations of agro‑food environments. It adapts DETR with an additional re‑ID head and uses Hungarian matching for cross‑view data association, leveraging both self‑ and cross‑attention to fuse modalities. A key contribution is the generation of large synthetic tomato plant datasets via an L‑system, enabling effective training alongside real data and improving robustness to occlusions and viewpoint shifts. Empirical results on real greenhouse sequences and synthetic viewpoints show MOT-DETR outperforms state‑of‑the‑art MOT methods, demonstrates resilience to camera pose noise, and highlights the value of 3D information for long‑range occlusions in agricultural robotics.

Abstract

In the current demand for automation in the agro-food industry, accurately detecting and localizing relevant objects in 3D is essential for successful robotic operations. However, this is a challenge due the presence of occlusions. Multi-view perception approaches allow robots to overcome occlusions, but a tracking component is needed to associate the objects detected by the robot over multiple viewpoints. Most multi-object tracking (MOT) algorithms are designed for high frame rate sequences and struggle with the occlusions generated by robots' motions and 3D environments. In this paper, we introduce MOT-DETR, a novel approach to detect and track objects in 3D over time using a combination of convolutional networks and transformers. Our method processes 2D and 3D data, and employs a transformer architecture to perform data fusion. We show that MOT-DETR outperforms state-of-the-art multi-object tracking methods. Furthermore, we prove that MOT-DETR can leverage 3D data to deal with long-term occlusions and large frame-to-frame distances better than state-of-the-art methods. Finally, we show how our method is resilient to camera pose noise that can affect the accuracy of point clouds. The implementation of MOT-DETR can be found here: https://github.com/drapado/mot-detr

MOT-DETR: 3D Single Shot Detection and Tracking with Transformers to build 3D representations for Agro-Food Robots

TL;DR

Abstract

Paper Structure (14 sections, 1 equation, 6 figures, 6 tables)

This paper contains 14 sections, 1 equation, 6 figures, 6 tables.

Introduction
Related work and contributions
Materials and Methods
Data pre-processing
MOT-DETR architecture
Training
Inference and tracking
Data
Results and Discussion
Detection performance and Inference speed
Tracking performance
Ablation studies
Effect of camera pose noise
Conclusion

Figures (6)

Figure 1: Example of the tracking and 3D representation results over five frames of a sequence. The top row shows the viewpoint color image and the bounding boxes of the tracked tomatoes, with their corresponding tracking ID. The bottom row contains the merged point cloud from the first frame. Additionally, overlapping the point cloud there are spheres that represent each tracked tomato in 3D.
Figure 2: MOT-DETR architecture: color images and point clouds are processed using independent CNNs, then the resulting feature maps are concatenated and passed to a transformer. The transformer predicts a set of bounding boxes with corresponding class and re-ID features. The re-ID features are passed to a data association algorithm that generates the ID of every detection.
Figure 3: Left. Example of a synthetic plant model. Right. Example of a generated viewpoint. Extra plants are added in the background to increase the complexity of the scene.
Figure 4: Robotic system used for data collection. The robot arm ABB IRB1200 is mounted over a mobile platform that allows motion over the greenhouse row rails. A Realsense L515 camera is mounted on the end-effector.
Figure 5: Left. Illustration of the path followed by the robot to collect viewpoints of real plants. Right. Example of a viewpoint in a real plant.
...and 1 more figures

MOT-DETR: 3D Single Shot Detection and Tracking with Transformers to build 3D representations for Agro-Food Robots

TL;DR

Abstract

MOT-DETR: 3D Single Shot Detection and Tracking with Transformers to build 3D representations for Agro-Food Robots

Authors

TL;DR

Abstract

Table of Contents

Figures (6)