μ-Bench: A Vision-Language Benchmark for Microscopy Understanding

Alejandro Lozano; Jeffrey Nirschl; James Burgess; Sanket Rajan Gupte; Yuhui Zhang; Alyssa Unell; Serena Yeung-Levy

μ-Bench: A Vision-Language Benchmark for Microscopy Understanding

Alejandro Lozano, Jeffrey Nirschl, James Burgess, Sanket Rajan Gupte, Yuhui Zhang, Alyssa Unell, Serena Yeung-Levy

TL;DR

μ-Bench addresses the lack of diverse, large-scale vision-language benchmarks for microscopy by introducing a benchmark with 17,235 images across 22 perception and cognition tasks and multiple microscopy modalities. The authors evaluate generalist and specialist VLMs, revealing substantial remaining limitations and a tendency for fine-tuning to cause catastrophic forgetting; they show that weight interpolation between base and fine-tuned models can mitigate forgetting and improve performance. The benchmark highlights how model design choices and data composition influence microscopy understanding and demonstrates the potential of ensemble-like weight merging to achieve robust, cross-task performance. By releasing μ-Bench under a permissive license, the work provides a practical, scalable platform to advance microscopy foundation models and guide future research in biomedical VLMs.

Abstract

Recent advances in microscopy have enabled the rapid generation of terabytes of image data in cell biology and biomedical research. Vision-language models (VLMs) offer a promising solution for large-scale biological image analysis, enhancing researchers' efficiency, identifying new image biomarkers, and accelerating hypothesis generation and scientific discovery. However, there is a lack of standardized, diverse, and large-scale vision-language benchmarks to evaluate VLMs' perception and cognition capabilities in biological image understanding. To address this gap, we introduce μ-Bench, an expert-curated benchmark encompassing 22 biomedical tasks across various scientific disciplines (biology, pathology), microscopy modalities (electron, fluorescence, light), scales (subcellular, cellular, tissue), and organisms in both normal and abnormal states. We evaluate state-of-the-art biomedical, pathology, and general VLMs on μ-Bench and find that: i) current models struggle on all categories, even for basic tasks such as distinguishing microscopy modalities; ii) current specialist models fine-tuned on biomedical data often perform worse than generalist models; iii) fine-tuning in specific microscopy domains can cause catastrophic forgetting, eroding prior biomedical knowledge encoded in their base model. iv) weight interpolation between fine-tuned and pre-trained models offers one solution to forgetting and improves general performance across biomedical tasks. We release μ-Bench under a permissive license to accelerate the research and development of microscopy foundation models.

μ-Bench: A Vision-Language Benchmark for Microscopy Understanding

TL;DR

Abstract

Paper Structure (46 sections, 2 equations, 20 figures, 18 tables)

This paper contains 46 sections, 2 equations, 20 figures, 18 tables.

Introduction
Related Work
Dataset collection methodology
Perception Dataset Curation
Cognitive Dataset Curation
Cognition Dataset Collection
Dataset Description
VLM benchmarking and results
Benchmarking approach
Generalist Contrastive (GC) VLMs
Generalist autoregressive (GA) VLMs
Specialist contrastive (SC) VLMs
Evaluation
Results
All models have high error rates
...and 31 more sections

Figures (20)

Figure 1: Data samples from $\mu$-Bench, covering perception (left) and cognition (right) tasks across subcellular, cellular, and tissue levels tasks across electron, fluorescence, and light microscopy.
Figure 2: $\mu$-Bench construction protocol. Perception dataset (left): first taxonomize use cases across subcellular, cellular, and tissue-level applications and collect representative datasets spanning multiple imaging modalities to test those scenarios. Next, datasets are converted to a common format, and the ontological information extracted from their metadata is standardized. Aided by this information, experts synthesize VQA pairs designed to test perception ability. Cognition dataset (right): First, domain experts use an interactive web application to upload their images and corresponding open-ended VQA pairs. Next, GPT-4 transforms the VQA pairs into a close-ended multiple-choice format. All GPT-4 generations are reviewed by experts before being incorporated into the cognition dataset.
Figure 3: $\mu$-Bench Perception dataset statistics. The Perception benchmark consists of microscopy images from 12 subdomains in Biology and Pathology, obtained using 8 different imaging techniques, including light, fluorescence, and electron microscopy. It includes 17 perception fine-grained tasks: 13 for classification and 4 for segmentation or object detection.
Figure 4: Performance comparison on the perception benchmark for the best-performing general domain auto-regressive model (GPT-4o), contrastive model (ALIGN), specialist biomedical contrastive model (BiomedCLIP), and specialist pathology contrastive model (CONCH). The top row shows performance in all of the $\mu$-Bench while the bottom row shows pathology-only samples.
Figure 5: Fine-tuning and microscopy perception generalization on $\mu$-Bench . Base CLIP models (blue) are fine-tuned to PLIP and QuiltNet using pathology data mixtures (pink). Weight-merging base models with their corresponding fine-tuned models (olive) improves specialist zero-shot performance on $\mu$-Bench coarse-grained (Left) and fined-grained (Right) perception.
...and 15 more figures

μ-Bench: A Vision-Language Benchmark for Microscopy Understanding

TL;DR

Abstract

μ-Bench: A Vision-Language Benchmark for Microscopy Understanding

Authors

TL;DR

Abstract

Table of Contents

Figures (20)