Table of Contents
Fetching ...

Multi-label Instance-level Generalised Visual Grounding in Agriculture

Mohammadreza Haghighat, Alzayat Saleh, Mostafa Rahimi Azghadi

Abstract

Understanding field imagery such as detecting plants and distinguishing individual crop and weed instances is a central challenge in precision agriculture. Despite progress in vision-language tasks like captioning and visual question answering, Visual Grounding (VG), localising language-referred objects, remains unexplored in agriculture. A key reason is the lack of suitable benchmark datasets for evaluating grounding models in field conditions, where many plants look highly similar, appear at multiple scales, and the referred target may be absent from the image. To address these limitations, we introduce gRef-CW, the first dataset designed for generalised visual grounding in agriculture, including negative expressions. Benchmarking current state-of-the-art grounding models on gRef-CW reveals a substantial domain gap, highlighting their inability to ground instances of crops and weeds. Motivated by these findings, we introduce Weed-VG, a modular framework that incorporates multi-label hierarchical relevance scoring and interpolation-driven regression. Weed-VG advances instance-level visual grounding and provides a clear baseline for developing VG methods in precision agriculture. Code will be released upon acceptance.

Multi-label Instance-level Generalised Visual Grounding in Agriculture

Abstract

Understanding field imagery such as detecting plants and distinguishing individual crop and weed instances is a central challenge in precision agriculture. Despite progress in vision-language tasks like captioning and visual question answering, Visual Grounding (VG), localising language-referred objects, remains unexplored in agriculture. A key reason is the lack of suitable benchmark datasets for evaluating grounding models in field conditions, where many plants look highly similar, appear at multiple scales, and the referred target may be absent from the image. To address these limitations, we introduce gRef-CW, the first dataset designed for generalised visual grounding in agriculture, including negative expressions. Benchmarking current state-of-the-art grounding models on gRef-CW reveals a substantial domain gap, highlighting their inability to ground instances of crops and weeds. Motivated by these findings, we introduce Weed-VG, a modular framework that incorporates multi-label hierarchical relevance scoring and interpolation-driven regression. Weed-VG advances instance-level visual grounding and provides a clear baseline for developing VG methods in precision agriculture. Code will be released upon acceptance.
Paper Structure (13 sections, 7 equations, 5 figures, 6 tables)

This paper contains 13 sections, 7 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Overview of the Weed-VG Framework for Precision Agriculture. This figure summarises the paper’s main contributions for tackling the challenges of visual grounding (VG) in agricultural imagery. It presents gRef-CW, a generalised Referring-expression dataset for Crop and Weed visual grounding, comprising 8,000+ high-resolution field images and 82,000 annotations at both the image level and instance level. The figure also highlights Hierarchical Relevance Scoring (HRS), a modular method that enables existence-aware instance grounding, first determining whether the referred instance is present and then localising it.
  • Figure 2: gRef-CW Data Collection and Annotation Pipeline. The dataset is constructed through a four-stage process: (1) Filtering out instances unrecognisable by humans and categorising images as single, mixed, or no-target from the selected subset; (2) Extracting instance attributes (e.g. size, category, and position) (3) Composing attributes into natural language templates to generate positive instance-level referring expressions; negative image-level sentences include the non-existence and (4) Replacing categories or swapping attributes to create test sentences.
  • Figure 3: Detailed instance distribution of the gRef-CW across splits. The figure displays (a) the count of instances stratified by scale (from Tiny to Large) and (b) the number of images categorised by scene density. Annotations within the bars indicate the percentage of Crop (C) and Weed (W) for each subset.
  • Figure 4: Architecture of the Weed-VG framework. It extends a grounding model with an HRS module that fuses visual and textual features via Multi-Head Cross-Attention and an FFN. HRS decomposes relevance into two levels: (Level 0) global existence detection, which predicts whether the referred object appears in the image, and (Level 1) instance relevance, which ranks region proposals by integrating sentence-level and word-level similarities. A constraint enforces logical consistency by conditioning instance localisation on global existence.
  • Figure 5: Qualitative results of Weed-VG on positive and negative referring expressions in gRef-CW across challenging scales (tiny–medium).