Table of Contents
Fetching ...

Zero-shot Hierarchical Plant Segmentation via Foundation Segmentation Models and Text-to-image Attention

Junhao Xing, Ryohei Miyakawa, Yang Yang, Xinpeng Liu, Risa Shinoda, Hiroaki Santo, Yosuke Toda, Fumio Okura

TL;DR

ZeroPlantSeg delivers a zero-shot solution for hierarchical segmentation of rosette-shaped plants by integrating SAM-based leaf extraction with vision-language cross-attention to identify leaf bases, followed by greedy clustering to form plant instances without labeled data. The method demonstrates strong cross-domain performance across three agricultural datasets, often outperforming other zero-shot baselines and approaching supervised baselines under domain shifts. This work provides a practical, training-free path for plant phenotyping tasks requiring both leaf- and plant-level segmentation, with clear avenues for extending to non-rosette crops. Overall, it highlights the value of combining foundation segmentation with language-augmented reasoning for structured plant analysis in real-world agricultural imagery.

Abstract

Foundation segmentation models achieve reasonable leaf instance extraction from top-view crop images without training (i.e., zero-shot). However, segmenting entire plant individuals with each consisting of multiple overlapping leaves remains challenging. This problem is referred to as a hierarchical segmentation task, typically requiring annotated training datasets, which are often species-specific and require notable human labor. To address this, we introduce ZeroPlantSeg, a zero-shot segmentation for rosette-shaped plant individuals from top-view images. We integrate a foundation segmentation model, extracting leaf instances, and a vision-language model, reasoning about plants' structures to extract plant individuals without additional training. Evaluations on datasets with multiple plant species, growth stages, and shooting environments demonstrate that our method surpasses existing zero-shot methods and achieves better cross-domain performance than supervised methods. Implementations are available at https://github.com/JunhaoXing/ZeroPlantSeg.

Zero-shot Hierarchical Plant Segmentation via Foundation Segmentation Models and Text-to-image Attention

TL;DR

ZeroPlantSeg delivers a zero-shot solution for hierarchical segmentation of rosette-shaped plants by integrating SAM-based leaf extraction with vision-language cross-attention to identify leaf bases, followed by greedy clustering to form plant instances without labeled data. The method demonstrates strong cross-domain performance across three agricultural datasets, often outperforming other zero-shot baselines and approaching supervised baselines under domain shifts. This work provides a practical, training-free path for plant phenotyping tasks requiring both leaf- and plant-level segmentation, with clear avenues for extending to non-rosette crops. Overall, it highlights the value of combining foundation segmentation with language-augmented reasoning for structured plant analysis in real-world agricultural imagery.

Abstract

Foundation segmentation models achieve reasonable leaf instance extraction from top-view crop images without training (i.e., zero-shot). However, segmenting entire plant individuals with each consisting of multiple overlapping leaves remains challenging. This problem is referred to as a hierarchical segmentation task, typically requiring annotated training datasets, which are often species-specific and require notable human labor. To address this, we introduce ZeroPlantSeg, a zero-shot segmentation for rosette-shaped plant individuals from top-view images. We integrate a foundation segmentation model, extracting leaf instances, and a vision-language model, reasoning about plants' structures to extract plant individuals without additional training. Evaluations on datasets with multiple plant species, growth stages, and shooting environments demonstrate that our method surpasses existing zero-shot methods and achieves better cross-domain performance than supervised methods. Implementations are available at https://github.com/JunhaoXing/ZeroPlantSeg.

Paper Structure

This paper contains 37 sections, 7 equations, 11 figures, 6 tables, 1 algorithm.

Figures (11)

  • Figure 1: Zero-shot hierarchical plant segmentation by our ZeroPlantSeg in comparison with other zero-shot methods. Top: Leaf instance segmentation. Bottom: Plant individual segmentation. The IoU indicates the accuracy of plant individual segmentation. Our method achieves compelling results for both leaf instance and plant individual segmentation, validated by the high IoU score.
  • Figure 2: Overview of ZeroPlantSeg. An input image is sliced with a sliding window, and each patch is fed to the foundation segmentation model (i.e., SAM) to obtain all leaf masks. Those masks are integrated with an NMS to discard duplicates for leaf instance segmentation. For plant instance segmentation, the masks are input to the pre-trained cross-attention module to calculate their keypoints. Those points are used to estimate plant individual instances in our unsupervised greedy clustering.
  • Figure 3: The procedure to obtain leaf candidate masks and leaf images. (a) Input RGB image. We crop the images with a sliding window (described as a red square in the figure) and enlarge them before being fed to a segmentation model for finer segmentation. (b) Binary leaf mask output by SAM and selected by OVSeg. (c) An RGB leaf image is obtained by multiplying the input image by the binary mask. (d) The image is cropped and resized to a square image of size ($D, D$).
  • Figure 4: The visualizations of feature maps and WLS lines (represented as blue lines in the figures) to obtain leaf keypoints.
  • Figure 5: Segmentation results for leaves and plants on the PhenoBench dataset. Ablation study of using GT leaf instances is denoted as "Ours (GT leaf)".
  • ...and 6 more figures