Table of Contents
Fetching ...

Crossmodal learning for Crop Canopy Trait Estimation

Timilehin T. Ayanlade, Anirudha Powadi, Talukder Z. Jubery, Baskar Ganapathysubramanian, Soumik Sarkar

TL;DR

The paper tackles the bottleneck of high-resolution phenotyping by bridging satellite and UAV sensing through cross-modal learning. It introduces a multimodal masked autoencoder that learns to generate UAV-like representations from satellite imagery, using an asymmetric masking strategy to bias reconstruction toward the UAV modality. Downstream tasks show that predicted UAV features closely match real UAV performance for yield and nitrogen prediction and can even improve satellite-only results when used as supplementary input. This approach enables scalable, UAV-level crop phenotyping across large or resource-limited field trials, with potential extensions to additional modalities and time-series data.

Abstract

Recent advances in plant phenotyping have driven widespread adoption of multi sensor platforms for collecting crop canopy reflectance data. This includes the collection of heterogeneous data across multiple platforms, with Unmanned Aerial Vehicles (UAV) seeing significant usage due to their high performance in crop monitoring, forecasting, and prediction tasks. Similarly, satellite missions have been shown to be effective for agriculturally relevant tasks. In contrast to UAVs, such missions are bound to the limitation of spatial resolution, which hinders their effectiveness for modern farming systems focused on micro-plot management. In this work, we propose a cross modal learning strategy that enriches high-resolution satellite imagery with UAV level visual detail for crop canopy trait estimation. Using a dataset of approximately co registered satellite UAV image pairs collected from replicated plots of 84 hybrid maize varieties across five distinct locations in the U.S. Corn Belt, we train a model that learns fine grained spectral spatial correspondences between sensing modalities. Results show that the generated UAV-like representations from satellite inputs consistently outperform real satellite imagery on multiple downstream tasks, including yield and nitrogen prediction, demonstrating the potential of cross-modal correspondence learning to bridge the gap between satellite and UAV sensing in agricultural monitoring.

Crossmodal learning for Crop Canopy Trait Estimation

TL;DR

The paper tackles the bottleneck of high-resolution phenotyping by bridging satellite and UAV sensing through cross-modal learning. It introduces a multimodal masked autoencoder that learns to generate UAV-like representations from satellite imagery, using an asymmetric masking strategy to bias reconstruction toward the UAV modality. Downstream tasks show that predicted UAV features closely match real UAV performance for yield and nitrogen prediction and can even improve satellite-only results when used as supplementary input. This approach enables scalable, UAV-level crop phenotyping across large or resource-limited field trials, with potential extensions to additional modalities and time-series data.

Abstract

Recent advances in plant phenotyping have driven widespread adoption of multi sensor platforms for collecting crop canopy reflectance data. This includes the collection of heterogeneous data across multiple platforms, with Unmanned Aerial Vehicles (UAV) seeing significant usage due to their high performance in crop monitoring, forecasting, and prediction tasks. Similarly, satellite missions have been shown to be effective for agriculturally relevant tasks. In contrast to UAVs, such missions are bound to the limitation of spatial resolution, which hinders their effectiveness for modern farming systems focused on micro-plot management. In this work, we propose a cross modal learning strategy that enriches high-resolution satellite imagery with UAV level visual detail for crop canopy trait estimation. Using a dataset of approximately co registered satellite UAV image pairs collected from replicated plots of 84 hybrid maize varieties across five distinct locations in the U.S. Corn Belt, we train a model that learns fine grained spectral spatial correspondences between sensing modalities. Results show that the generated UAV-like representations from satellite inputs consistently outperform real satellite imagery on multiple downstream tasks, including yield and nitrogen prediction, demonstrating the potential of cross-modal correspondence learning to bridge the gap between satellite and UAV sensing in agricultural monitoring.

Paper Structure

This paper contains 14 sections, 1 equation, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Multimodal framework exploits masking to learn the cross-modal predictive coding between satellite and UAV data
  • Figure 2: Samples from training set. We sample two masks, with a total of 66 visible patches out of 392 patches using a biased Dirichlet concentration parameter $\alpha_{\text{sat}} > \alpha_{\text{uav}}$.
  • Figure 3: Field-level visualization of concatenated plot-level images across three time points for a representative location, comparing real satellite imagery (left column), real UAV imagery (middle column), and predicted UAV imagery (right column). Notable sensor artifacts, such as color tints caused by inconsistent camera calibration in the real UAV data, are visibly reduced or absent in the predicted UAV imagery. This highlights the consistency of our pipeline in simulating high-resolution UAV-level representations from satellite input.
  • Figure 4: Comparison of real satellite, real UAV, and predicted UAV imagery at a single time point for Ames, which was affected by cloud occlusion during UAV data collection. Cloud cover during data collection introduced shadow artifacts in portions of the UAV imagery, which appears as darkened or low-contrast patches. In contrast, the predicted UAV imagery effectively mitigates these artifacts, producing a cleaner and more consistent visual representation across plots. This shows that our pipeline can synthesize consistent UAV-level outputs.
  • Figure 5: Examples showing cross-modal conditioning using time of day simulations. For each example, the full satellite image and two UAV patches are provided as input. The UAV patches are augmented to simulate different lighting conditions corresponding to morning, afternoon, and evening by adjusting tint, brightness, and contrast. The resulting predictions show coherent propagation of these visual cues across the entire output, indicating the model's sensitivity to subtle modality variations and its ability to generate consistent UAV-level imagery under diverse visual contexts.
  • ...and 2 more figures