Transformer-Based Spatio-Temporal Association of Apple Fruitlets
Harry Freeman, George Kantor
TL;DR
This work tackles the problem of spatio-temporal association of small apple fruitlets across days using stereo imagery. It introduces a transformer-based pipeline that encodes per-fruitlet shape $d_i$ and position $p_i$ from stereo point clouds and refines these features through self- and cross-attention to predict cross-day correspondences. On orchard data, the method achieves an F1-score of $92.4\%$, outperforming ICP-Assoc, Desc-Assoc, and Loftr-based baselines, with ablations highlighting the importance of the positional descriptor and shape pre-training. Beyond apples, the approach generalizes to Pheno4D datasets (tomato and maize) with high precision and recall, demonstrating potential for scalable growth monitoring and yield estimation under field conditions.
Abstract
In this paper, we present a transformer-based method to spatio-temporally associate apple fruitlets in stereo-images collected on different days and from different camera poses. State-of-the-art association methods in agriculture are dedicated towards matching larger crops using either high-resolution point clouds or temporally stable features, which are both difficult to obtain for smaller fruit in the field. To address these challenges, we propose a transformer-based architecture that encodes the shape and position of each fruitlet, and propagates and refines these features through a series of transformer encoder layers with alternating self and cross-attention. We demonstrate that our method is able to achieve an F1-score of 92.4% on data collected in a commercial apple orchard and outperforms all baselines and ablations.
