Table of Contents
Fetching ...

Transformer-Based Spatio-Temporal Association of Apple Fruitlets

Harry Freeman, George Kantor

TL;DR

This work tackles the problem of spatio-temporal association of small apple fruitlets across days using stereo imagery. It introduces a transformer-based pipeline that encodes per-fruitlet shape $d_i$ and position $p_i$ from stereo point clouds and refines these features through self- and cross-attention to predict cross-day correspondences. On orchard data, the method achieves an F1-score of $92.4\%$, outperforming ICP-Assoc, Desc-Assoc, and Loftr-based baselines, with ablations highlighting the importance of the positional descriptor and shape pre-training. Beyond apples, the approach generalizes to Pheno4D datasets (tomato and maize) with high precision and recall, demonstrating potential for scalable growth monitoring and yield estimation under field conditions.

Abstract

In this paper, we present a transformer-based method to spatio-temporally associate apple fruitlets in stereo-images collected on different days and from different camera poses. State-of-the-art association methods in agriculture are dedicated towards matching larger crops using either high-resolution point clouds or temporally stable features, which are both difficult to obtain for smaller fruit in the field. To address these challenges, we propose a transformer-based architecture that encodes the shape and position of each fruitlet, and propagates and refines these features through a series of transformer encoder layers with alternating self and cross-attention. We demonstrate that our method is able to achieve an F1-score of 92.4% on data collected in a commercial apple orchard and outperforms all baselines and ablations.

Transformer-Based Spatio-Temporal Association of Apple Fruitlets

TL;DR

This work tackles the problem of spatio-temporal association of small apple fruitlets across days using stereo imagery. It introduces a transformer-based pipeline that encodes per-fruitlet shape and position from stereo point clouds and refines these features through self- and cross-attention to predict cross-day correspondences. On orchard data, the method achieves an F1-score of , outperforming ICP-Assoc, Desc-Assoc, and Loftr-based baselines, with ablations highlighting the importance of the positional descriptor and shape pre-training. Beyond apples, the approach generalizes to Pheno4D datasets (tomato and maize) with high precision and recall, demonstrating potential for scalable growth monitoring and yield estimation under field conditions.

Abstract

In this paper, we present a transformer-based method to spatio-temporally associate apple fruitlets in stereo-images collected on different days and from different camera poses. State-of-the-art association methods in agriculture are dedicated towards matching larger crops using either high-resolution point clouds or temporally stable features, which are both difficult to obtain for smaller fruit in the field. To address these challenges, we propose a transformer-based architecture that encodes the shape and position of each fruitlet, and propagates and refines these features through a series of transformer encoder layers with alternating self and cross-attention. We demonstrate that our method is able to achieve an F1-score of 92.4% on data collected in a commercial apple orchard and outperforms all baselines and ablations.

Paper Structure

This paper contains 17 sections, 7 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Top: Example of fruit growth over four day period between sizes. Left, middle, and right are day 1, day 3, and day 5 with average diameters 5.5mm, 7.9mm, and 9.7mm respectively. Bottom: Images of same fruitlet cluster taken from different camera poses which often occurs in the field.
  • Figure 2: Overview of our point cloud extraction and spatio-temporal association pipeline
  • Figure 3: Results and baseline comparison for precision and recall when matching is performed across a specified number of days.
  • Figure 4: Ablation comparison for precision and recall when matching is performed across a specified number of days.
  • Figure 5: Examples of spatio-temporal association results. Left column: correctly associated fruitlets. Middle column: correctly associated fruitlets when a fruitlet is occluded or fallen off. Right: incorrect association examples where red and orange lines indicate false positive and negative matches respectively.