Table of Contents
Fetching ...

SPROUT: A Scalable Diffusion Foundation Model for Agricultural Vision

Shuai Xiang, Wei Guo, James Burridge, Shouyang Liu, Hao Lu, Tokihiro Fukatsu

Abstract

Vision Foundation Models (VFM) pre-trained on large-scale unlabeled data have achieved remarkable success on general computer vision tasks, yet typically suffer from significant domain gaps when applied to agriculture. In this context, we introduce $SPROUT$ ($S$calable $P$lant $R$epresentation model via $O$pen-field $U$nsupervised $T$raining), a multi-crop, multi-task agricultural foundation model trained via diffusion denoising. SPROUT leverages a VAE-free Pixel-space Diffusion Transformer to learn rich, structure-aware representations through denoising and enabling efficient end-to-end training. We pre-train SPROUT on a curated dataset of 2.6 million high-quality agricultural images spanning diverse crops, growth stages, and environments. Extensive experiments demonstrate that SPROUT consistently outperforms state-of-the-art web-pretrained and agricultural foundation models across a wide range of downstream tasks, while requiring substantially lower pre-training cost. The code and model are available at https://github.com/UTokyo-FieldPhenomics-Lab/SPROUT.

SPROUT: A Scalable Diffusion Foundation Model for Agricultural Vision

Abstract

Vision Foundation Models (VFM) pre-trained on large-scale unlabeled data have achieved remarkable success on general computer vision tasks, yet typically suffer from significant domain gaps when applied to agriculture. In this context, we introduce (calable lant epresentation model via pen-field nsupervised raining), a multi-crop, multi-task agricultural foundation model trained via diffusion denoising. SPROUT leverages a VAE-free Pixel-space Diffusion Transformer to learn rich, structure-aware representations through denoising and enabling efficient end-to-end training. We pre-train SPROUT on a curated dataset of 2.6 million high-quality agricultural images spanning diverse crops, growth stages, and environments. Extensive experiments demonstrate that SPROUT consistently outperforms state-of-the-art web-pretrained and agricultural foundation models across a wide range of downstream tasks, while requiring substantially lower pre-training cost. The code and model are available at https://github.com/UTokyo-FieldPhenomics-Lab/SPROUT.

Paper Structure

This paper contains 24 sections, 5 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Left: PCA visualization of SPROUT feature maps. In the SPROUT embedding space, different plant organs exhibit clear semantic separation, while the same organ shares consistent semantics. SPROUT captures and understands the structural information of plants. Right: SPROUT's performance on agricultural vision tasks. Absolute Relative Error is the metric for depth estimation, Mean Square Error is the metric for counting, and Intersection over Union (IoU) is used for all other tasks. Metrics where lower values indicate better performance are inversely normalized. SPROUT significantly outperforms general-purpose VFMs across a wide range of tasks, particularly in structural understanding and dense prediction tasks.
  • Figure 2: SPROUT Structure and Training. The model output is parameterized in the form of $\epsilon$. The approach operates directly within pixel space.
  • Figure 3: Comparison of dense features. We employ Principal Component Analysis (PCA) to reduce the dimensionality of the feature maps to three dimensions, projecting them into RGB space ($\mathbb{R}^{h \times w \times c} \to \mathbb{R}^{h \times w \times 3}$). Size of all feature maps is $128 \times 128$. Compared to prior methods, SPROUT yields clearer features with less noise and distinct semantics.
  • Figure 4: Depth estimation performance comparison. Our method produces sharp, accurate depth maps.
  • Figure 5: SPROUT demonstrates strong scaling behavior: downstream fine-tuning performance improves with increasing model size and pre-training compute, following a clear power-law relationship.
  • ...and 2 more figures