FungalZSL: Zero-Shot Fungal Classification with Image Captioning Using a Synthetic Data Approach
Anju Rani, Daniel O. Arroyo, Petar Durdevic
TL;DR
The paper tackles the challenge of zero-shot fungal classification in vision–language models by addressing data scarcity with two complementary sources: LLM-generated textual descriptions of fungal growth stages and a synthetic image dataset representing fine-grained growth stages (spore, hyphae, mycelium). The authors align these modalities in CLIP's shared representation space and fine-tune CLIP using a cross-modal contrastive loss $L_{total}=L_{image}+L_{text}$, exploring multiple transformer architectures. Key findings show that ViT-L/14@336px achieves Recall@1 ≈ 0.97 on the synthetic dataset, and that GPT-4o1-derived captions offer the best alignment among tested LLMs, while hyphae and mycelium remain the most challenging pair due to visual overlap. The approach demonstrates the value of synthetic data plus domain-specific, LLM-generated text for improving fine-grained, zero-shot fungal classification, with practical implications for automated fungal identification and monitoring; future work includes expanding growth stages, enriching text embeddings, and exploring adaptive fine-tuning strategies.
Abstract
The effectiveness of zero-shot classification in large vision-language models (VLMs), such as Contrastive Language-Image Pre-training (CLIP), depends on access to extensive, well-aligned text-image datasets. In this work, we introduce two complementary data sources, one generated by large language models (LLMs) to describe the stages of fungal growth and another comprising a diverse set of synthetic fungi images. These datasets are designed to enhance CLIPs zero-shot classification capabilities for fungi-related tasks. To ensure effective alignment between text and image data, we project them into CLIPs shared representation space, focusing on different fungal growth stages. We generate text using LLaMA3.2 to bridge modality gaps and synthetically create fungi images. Furthermore, we investigate knowledge transfer by comparing text outputs from different LLM techniques to refine classification across growth stages.
