Crossmodal learning for Crop Canopy Trait Estimation
Timilehin T. Ayanlade, Anirudha Powadi, Talukder Z. Jubery, Baskar Ganapathysubramanian, Soumik Sarkar
TL;DR
The paper tackles the bottleneck of high-resolution phenotyping by bridging satellite and UAV sensing through cross-modal learning. It introduces a multimodal masked autoencoder that learns to generate UAV-like representations from satellite imagery, using an asymmetric masking strategy to bias reconstruction toward the UAV modality. Downstream tasks show that predicted UAV features closely match real UAV performance for yield and nitrogen prediction and can even improve satellite-only results when used as supplementary input. This approach enables scalable, UAV-level crop phenotyping across large or resource-limited field trials, with potential extensions to additional modalities and time-series data.
Abstract
Recent advances in plant phenotyping have driven widespread adoption of multi sensor platforms for collecting crop canopy reflectance data. This includes the collection of heterogeneous data across multiple platforms, with Unmanned Aerial Vehicles (UAV) seeing significant usage due to their high performance in crop monitoring, forecasting, and prediction tasks. Similarly, satellite missions have been shown to be effective for agriculturally relevant tasks. In contrast to UAVs, such missions are bound to the limitation of spatial resolution, which hinders their effectiveness for modern farming systems focused on micro-plot management. In this work, we propose a cross modal learning strategy that enriches high-resolution satellite imagery with UAV level visual detail for crop canopy trait estimation. Using a dataset of approximately co registered satellite UAV image pairs collected from replicated plots of 84 hybrid maize varieties across five distinct locations in the U.S. Corn Belt, we train a model that learns fine grained spectral spatial correspondences between sensing modalities. Results show that the generated UAV-like representations from satellite inputs consistently outperform real satellite imagery on multiple downstream tasks, including yield and nitrogen prediction, demonstrating the potential of cross-modal correspondence learning to bridge the gap between satellite and UAV sensing in agricultural monitoring.
