Benchmarking In-the-wild Multimodal Disease Recognition and A Versatile Baseline

Tianqi Wei; Zhi Chen; Zi Huang; Xin Yu

Benchmarking In-the-wild Multimodal Disease Recognition and A Versatile Baseline

Tianqi Wei, Zhi Chen, Zi Huang, Xin Yu

TL;DR

This work tackles plant disease recognition in-the-wild, where inter-class similarity is high and intra-class appearance varies greatly. It introduces PlantWild, a large-scale multimodal dataset with images and descriptive prompts for 89 disease classes, enabling richer textual information for discrimination. The proposed MVPDR baseline builds multimodal prototypes (visual and textual) using CLIP, supporting fully supervised, few-shot, and zero-shot scenarios by learning prototype weights while keeping the CLIP backbone fixed. Across experiments, MVPDR achieves state-of-the-art performance on wild datasets like PlantWild and PlantDoc, demonstrates robust generalization, and offers lesion localization capabilities, highlighting the value of multimodal cues for real-world plant disease recognition.

Abstract

Existing plant disease classification models have achieved remarkable performance in recognizing in-laboratory diseased images. However, their performance often significantly degrades in classifying in-the-wild images. Furthermore, we observed that in-the-wild plant images may exhibit similar appearances across various diseases (i.e., small inter-class discrepancy) while the same diseases may look quite different (i.e., large intra-class variance). Motivated by this observation, we propose an in-the-wild multimodal plant disease recognition dataset that contains the largest number of disease classes but also text-based descriptions for each disease. Particularly, the newly provided text descriptions are introduced to provide rich information in textual modality and facilitate in-the-wild disease classification with small inter-class discrepancy and large intra-class variance issues. Therefore, our proposed dataset can be regarded as an ideal testbed for evaluating disease recognition methods in the real world. In addition, we further present a strong yet versatile baseline that models text descriptions and visual data through multiple prototypes for a given class. By fusing the contributions of multimodal prototypes in classification, our baseline can effectively address the small inter-class discrepancy and large intra-class variance issues. Remarkably, our baseline model can not only classify diseases but also recognize diseases in few-shot or training-free scenarios. Extensive benchmarking results demonstrate that our proposed in-the-wild multimodal dataset sets many new challenges to the plant disease recognition task and there is a large space to improve for future works.

Benchmarking In-the-wild Multimodal Disease Recognition and A Versatile Baseline

TL;DR

Abstract

Paper Structure (21 sections, 8 equations, 7 figures, 4 tables, 1 algorithm)

This paper contains 21 sections, 8 equations, 7 figures, 4 tables, 1 algorithm.

Introduction
Related work
Plant Disease Classification
Vision-Language Modeling
Prompt Design with Large Language Models
Methodology
Proposed PlantWild Dataset
Data Collection and Filtering
Textual Prompt Generation
Practicality and Necessity of PlantWild.
Multimodal Versatile Plant Disease Recognition Baseline
Prototype Construction
Multimodal Prototype Learning
Inference
Experiments
...and 6 more sections

Figures (7)

Figure 1: Left: Illustration of intra-class variances and inter-class discrepancies among plant disease images. Right: statistics of existing plant disease datasets. The marker size corresponds to the number of plant diseases. Our proposed PlantWild dataset not only encompasses the largest number of disease classes but also includes the highest volume of in-the-wild images.
Figure 2: The curation process of our PlantWild dataset. Before annotations, data contains many irrelevant images and is very noisy. After annotations, PlantWild consists of in-the-wild disease-relevant images and text descriptions for each class.
Figure 3: Illustration of statistics of our PlantWild dataset. It contains 56 diseased and 33 healthy classes. The number of images within a class ranges from 589 to 44 images.
Figure 4: Overall architecture of our baseline. CLIP encoders extract features from images and text for each category and then multiple prototypes are constructed by grouping visual features. Given a test image, both the visual and textual prototypes can be used for classification.
Figure 5: Similarity maps of zero-shot CLIP features and prototypes across 10 plant diseases: (a) bean halo blight; (b) cucumber angular leaf spot; (c) garlic rust; (d) corn smut; (e) apple scab; (f) bean rust; (g) coffee leaf rust; (h) maple tar spot; (i) plum pocket disease; (j) strawberry leaf scorch.
...and 2 more figures

Benchmarking In-the-wild Multimodal Disease Recognition and A Versatile Baseline

TL;DR

Abstract

Benchmarking In-the-wild Multimodal Disease Recognition and A Versatile Baseline

Authors

TL;DR

Abstract

Table of Contents

Figures (7)