Table of Contents
Fetching ...

FungalZSL: Zero-Shot Fungal Classification with Image Captioning Using a Synthetic Data Approach

Anju Rani, Daniel O. Arroyo, Petar Durdevic

TL;DR

The paper tackles the challenge of zero-shot fungal classification in vision–language models by addressing data scarcity with two complementary sources: LLM-generated textual descriptions of fungal growth stages and a synthetic image dataset representing fine-grained growth stages (spore, hyphae, mycelium). The authors align these modalities in CLIP's shared representation space and fine-tune CLIP using a cross-modal contrastive loss $L_{total}=L_{image}+L_{text}$, exploring multiple transformer architectures. Key findings show that ViT-L/14@336px achieves Recall@1 ≈ 0.97 on the synthetic dataset, and that GPT-4o1-derived captions offer the best alignment among tested LLMs, while hyphae and mycelium remain the most challenging pair due to visual overlap. The approach demonstrates the value of synthetic data plus domain-specific, LLM-generated text for improving fine-grained, zero-shot fungal classification, with practical implications for automated fungal identification and monitoring; future work includes expanding growth stages, enriching text embeddings, and exploring adaptive fine-tuning strategies.

Abstract

The effectiveness of zero-shot classification in large vision-language models (VLMs), such as Contrastive Language-Image Pre-training (CLIP), depends on access to extensive, well-aligned text-image datasets. In this work, we introduce two complementary data sources, one generated by large language models (LLMs) to describe the stages of fungal growth and another comprising a diverse set of synthetic fungi images. These datasets are designed to enhance CLIPs zero-shot classification capabilities for fungi-related tasks. To ensure effective alignment between text and image data, we project them into CLIPs shared representation space, focusing on different fungal growth stages. We generate text using LLaMA3.2 to bridge modality gaps and synthetically create fungi images. Furthermore, we investigate knowledge transfer by comparing text outputs from different LLM techniques to refine classification across growth stages.

FungalZSL: Zero-Shot Fungal Classification with Image Captioning Using a Synthetic Data Approach

TL;DR

The paper tackles the challenge of zero-shot fungal classification in vision–language models by addressing data scarcity with two complementary sources: LLM-generated textual descriptions of fungal growth stages and a synthetic image dataset representing fine-grained growth stages (spore, hyphae, mycelium). The authors align these modalities in CLIP's shared representation space and fine-tune CLIP using a cross-modal contrastive loss , exploring multiple transformer architectures. Key findings show that ViT-L/14@336px achieves Recall@1 ≈ 0.97 on the synthetic dataset, and that GPT-4o1-derived captions offer the best alignment among tested LLMs, while hyphae and mycelium remain the most challenging pair due to visual overlap. The approach demonstrates the value of synthetic data plus domain-specific, LLM-generated text for improving fine-grained, zero-shot fungal classification, with practical implications for automated fungal identification and monitoring; future work includes expanding growth stages, enriching text embeddings, and exploring adaptive fine-tuning strategies.

Abstract

The effectiveness of zero-shot classification in large vision-language models (VLMs), such as Contrastive Language-Image Pre-training (CLIP), depends on access to extensive, well-aligned text-image datasets. In this work, we introduce two complementary data sources, one generated by large language models (LLMs) to describe the stages of fungal growth and another comprising a diverse set of synthetic fungi images. These datasets are designed to enhance CLIPs zero-shot classification capabilities for fungi-related tasks. To ensure effective alignment between text and image data, we project them into CLIPs shared representation space, focusing on different fungal growth stages. We generate text using LLaMA3.2 to bridge modality gaps and synthetically create fungi images. Furthermore, we investigate knowledge transfer by comparing text outputs from different LLM techniques to refine classification across growth stages.

Paper Structure

This paper contains 7 sections, 6 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: Overview of the adapted CLIP model.
  • Figure 2: Fungi Dataset: (a) Spore, (b) Hyphae, and (c) Mycelium.
  • Figure 3: Textual description of using various LLMs.
  • Figure 4: Zero-shot classification with proposed CLIP model: (a) top: GPT-4o1, (b) centre: Claude 3.5, and (c) bottom: Gemini 2.0
  • Figure 5: Recall@1 qualitative results. The predicted and true class labels from the validation dataset have been mentioned in this figure. The results which are correctly ranked as 1 for their corresponding class label are bordered green while incorrect results have been bordered red.