Table of Contents
Fetching ...

PlantTrack: Task-Driven Plant Keypoint Tracking with Zero-Shot Sim2Real Transfer

Samhita Marri, Arun N. Sivakumar, Naveen K. Uppalapati, Girish Chowdhary

TL;DR

PlantTrack addresses robust tracking of plant features in cluttered, deformable environments for agricultural robotics. It uses DINOv2 to extract high-dimensional features, applies a depth-based foreground filter, and trains a multi-stage heatmap predictor to localize leaves and fruits; the heatmap peaks seed an online TAPIR tracker, with both DINOv2 and TAPIR weights frozen. With as few as 20 synthetic images for training, the method achieves zero-shot Sim2Real transfer to real plants and demonstrates online tracking of leaves and fruits. This framework showcases how foundation models can be combined with synthetic data to enable scalable, task-specific keypoint tracking for phenotyping, pruning, and harvesting tasks.

Abstract

Tracking plant features is crucial for various agricultural tasks like phenotyping, pruning, or harvesting, but the unstructured, cluttered, and deformable nature of plant environments makes it a challenging task. In this context, the recent advancements in foundational models show promise in addressing this challenge. In our work, we propose PlantTrack where we utilize DINOv2 which provides high-dimensional features, and train a keypoint heatmap predictor network to identify the locations of semantic features such as fruits and leaves which are then used as prompts for point tracking across video frames using TAPIR. We show that with as few as 20 synthetic images for training the keypoint predictor, we achieve zero-shot Sim2Real transfer, enabling effective tracking of plant features in real environments.

PlantTrack: Task-Driven Plant Keypoint Tracking with Zero-Shot Sim2Real Transfer

TL;DR

PlantTrack addresses robust tracking of plant features in cluttered, deformable environments for agricultural robotics. It uses DINOv2 to extract high-dimensional features, applies a depth-based foreground filter, and trains a multi-stage heatmap predictor to localize leaves and fruits; the heatmap peaks seed an online TAPIR tracker, with both DINOv2 and TAPIR weights frozen. With as few as 20 synthetic images for training, the method achieves zero-shot Sim2Real transfer to real plants and demonstrates online tracking of leaves and fruits. This framework showcases how foundation models can be combined with synthetic data to enable scalable, task-specific keypoint tracking for phenotyping, pruning, and harvesting tasks.

Abstract

Tracking plant features is crucial for various agricultural tasks like phenotyping, pruning, or harvesting, but the unstructured, cluttered, and deformable nature of plant environments makes it a challenging task. In this context, the recent advancements in foundational models show promise in addressing this challenge. In our work, we propose PlantTrack where we utilize DINOv2 which provides high-dimensional features, and train a keypoint heatmap predictor network to identify the locations of semantic features such as fruits and leaves which are then used as prompts for point tracking across video frames using TAPIR. We show that with as few as 20 synthetic images for training the keypoint predictor, we achieve zero-shot Sim2Real transfer, enabling effective tracking of plant features in real environments.
Paper Structure (7 sections, 2 equations, 5 figures)

This paper contains 7 sections, 2 equations, 5 figures.

Figures (5)

  • Figure 1: Predicted keypoint tracking using TAPIR at various frames where (top) leaves and (bottom) fruits predicted in the first frame are given as prompt.
  • Figure 2: PlantTrack: First we extract high-dimensional features from RGB image using DINOv2. Then, we apply the depth mask to focus only on the foreground features. These foreground features are passed through the keypoint heatmap prediction network which is a convolution-based architecture. The input and output channels for each convolution block $C$ are indicated by $f, h, 2$ values. Once the heatmap prediction network is trained, peak pixel locations across the image are predicted and are used as prompts to perform tracking across video frames using TAPIR.
  • Figure 3: Sample data collected in Blender using transformed global coordinates of the center of fruits and leaves, in (left) current camera view, (center) the heatmaps and (right) binary masks are obtained.
  • Figure 4: Depth Mask Filter: To focus only on the foreground features 2-stage PCA is unreliable as shown in the bottom row, where the features of fruits are missing. It requires fine-tuning of the threshold parameter which is a tedious process to scale. Therefore, we implement filtering out the background using a depth mask, and as seen in the last column, all the features remain intact. Note that we use all the DINOv2 predicted features in the foreground after filtering with a depth mask and PCA is shown here just for visualization purposes. Also, the low resolution in columns 2, 4, and 5 is due to the downscaling of the input image by a factor of 14 in DINOv2.
  • Figure 5: Keypoint Heatmap Prediction Inference: Testing on (top row) unseen synthetic plant, and zero-shot detection on (middle row) real plant and (bottom row) real plant with a cluttered background. The leaf heatmap predictions are shown in column $2$ and leaf and fruit heatmap predictions are shown in column $4$. After normalizing and collecting peaks $>0.6$, the leaf (indicated in red) and fruit (indicated in blue) keypoints in columns $4$ and $5$ respectively.