Vision-Language Semantic Grounding for Multi-Domain Crop-Weed Segmentation

Nazia Hossain; Xintong Jiang; Yu Tian; Philippe Seguin; O. Grant Clark; Shangpeng Sun

Vision-Language Semantic Grounding for Multi-Domain Crop-Weed Segmentation

Nazia Hossain, Xintong Jiang, Yu Tian, Philippe Seguin, O. Grant Clark, Shangpeng Sun

TL;DR

The proposed Vision-Language Weed Segmentation (VL-WS), a novel framework that addresses this limitation by grounding pixel-level segmentation in semantically aligned, domain-invariant representations, highlights the potential of vision-language alignment to enable scalable, label-efficient segmentation models deployable across diverse real-world agricultural domains.

Abstract

Fine-grained crop-weed segmentation is essential for enabling targeted herbicide application in precision agriculture. However, existing deep learning models struggle to generalize across heterogeneous agricultural environments due to reliance on dataset-specific visual features. We propose Vision-Language Weed Segmentation (VL-WS), a novel framework that addresses this limitation by grounding pixel-level segmentation in semantically aligned, domain-invariant representations. Our architecture employs a dual-encoder design, where frozen Contrastive Language-Image Pretraining (CLIP) embeddings and task-specific spatial features are fused and modulated via Feature-wise Linear Modulation (FiLM) layers conditioned on natural language captions. This design enables image level textual descriptions to guide channel-wise feature refinement while preserving fine-grained spatial localization. Unlike prior works restricted to training and evaluation on single-source datasets, VL-WS is trained on a unified corpus that includes close-range ground imagery (robotic platforms) and high-altitude UAV imagery, covering diverse crop types, weed species, growth stages, and sensing conditions. Experimental results across four benchmark datasets demonstrate the effectiveness of our framework, with VL-WS achieving a mean Dice score of 91.64% and outperforming the CNN baseline by 4.98%. The largest gains occur on the most challenging weed class, where VL-WS attains 80.45% Dice score compared to 65.03% for the best baseline, representing a 15.42% improvement. VL-WS further maintains stable weed segmentation performance under limited target-domain supervision, indicating improved generalization and data efficiency. These findings highlight the potential of vision-language alignment to enable scalable, label-efficient segmentation models deployable across diverse real-world agricultural domains.

Vision-Language Semantic Grounding for Multi-Domain Crop-Weed Segmentation

TL;DR

Abstract

Paper Structure (43 sections, 9 equations, 11 figures, 4 tables)

This paper contains 43 sections, 9 equations, 11 figures, 4 tables.

Introduction
Related work
Challenges of Multi-Dataset Training with Shared Labels
CLIP for Semantic Robustness in Multi-Dataset Learning
Materials and Methods
Datasets and Annotations
UAV Soybean Dataset
Phenobench Dataset
GrowingSoy Dataset
ROSE Dataset
Image Captions for Vision-Language Segmentation
Proposed Network Architecture
Overview
Visual Encoder Backbone
Language and Image Embeddings via CLIP
...and 28 more sections

Figures (11)

Figure 1: Image-text cosine similarity scores produced by a frozen CLIP model for soybean field images with different levels of weed presence. For crop-dominant scenes, image embeddings show higher similarity to the soybean prompt, while weed-dense scenes exhibit higher similarity to the weed prompt. This shift in similarity demonstrates that CLIP implicitly captures agronomic scene semantics without task-specific fine-tuning.
Figure 2: Overview of the proposed Vision-Language Weed Segmentation (VL-WS) framework. Dense spatial features extracted by a task-specific visual backbone are concatenated with global image embeddings from a pretrained CLIP encoder. The fused features are modulated by natural-language captions through Feature-wise Linear Modulation (FiLM), enabling text-conditioned channel adaptation for pixel-level crop-weed segmentation.
Figure 3: Study area and UAV orthomosaic of the experimental field. The figure illustrates the geographic context of the study site. The top-left panel shows the location of Sainte-Anne-de-Bellevue within the Province of Quebec, Canada, where the study area is located. The top-right panel presents an aerial view of the surrounding agricultural landscape, highlighting the specific experimental plot (outlined in red; 88 m $\times$ 18 m). The bottom panel displays the high-resolution UAV orthomosaic acquired using a DJI Mavic 3 Multispectral (M3M) at 5 m altitude, with a spatial scale bar provided for reference.
Figure 4: Representative RGB image tiles and corresponding ground-truth segmentation masks from the four weed segmentation datasets used in this study: UAV Soybean, PhenoBench, GrowingSoy, and ROSE. For each dataset, paired image-mask examples are shown, with RGB images on the left and manually annotated masks on the right. Mask colors denote background (black), crop (green), and weed (red). The datasets span diverse crop types, weed densities, and imaging conditions,, including aerial and ground-based imagery.
Figure 5: Example RGB image tiles paired with agronomy-aware natural language captions generated for each dataset (UAV Soybean, PhenoBench, GrowingSoy, and ROSE). Captions are produced using a standardized template and describe crop and weed presence, coarse spatial layout, and salient visual attributes. These image-caption pairs serve as multimodal inputs for training and evaluating the proposed vision-language segmentation framework.
...and 6 more figures

Vision-Language Semantic Grounding for Multi-Domain Crop-Weed Segmentation

TL;DR

Abstract

Vision-Language Semantic Grounding for Multi-Domain Crop-Weed Segmentation

Authors

TL;DR

Abstract

Table of Contents

Figures (11)