Table of Contents
Fetching ...

In-Context Adaptation of VLMs for Few-Shot Cell Detection in Optical Microscopy

Shreyan Ganguly, Angona Biswas, Jaydeep Rade, Md Hasibul Hasan Hasib, Nabila Masud, Nitish Singla, Abhipsa Dash, Ushashi Bhattacharjee, Aditya Balu, Anwesha Sarkar, Adarsh Krishnamurthy, Soumik Sarkar

TL;DR

This work addresses the challenge of applying vision-language foundation models to few-shot cell detection in biomedical microscopy by quantifying domain shift and in-context adaptation using the Micro-OD benchmark. It introduces 252 annotated images across 11 cell types from four sources and evaluates eight VLMs under zero-shot and few-shot regimes, plus a hybrid FSOD pipeline that decouples localization from classification. Key findings show zero-shot performance is weak due to domain gap, while few-shot support improves detection with diminishing returns after about six examples; the benefit of implicit reasoning tokens is task-dependent—helpful for end-to-end localization but not always for pre-localized crop classification—highlighting practical considerations for deploying open-vocabulary detectors in biology. The Micro-OD benchmark provides a reproducible platform to advance open-vocabulary microscopy detection and guide future improvements in model architectures and prompting strategies for biomedical imaging.

Abstract

Foundation vision-language models (VLMs) excel on natural images, but their utility for biomedical microscopy remains underexplored. In this paper, we investigate how in-context learning enables state-of-the-art VLMs to perform few-shot object detection when large annotated datasets are unavailable, as is often the case with microscopic images. We introduce the Micro-OD benchmark, a curated collection of 252 images specifically curated for in-context learning, with bounding-box annotations spanning 11 cell types across four sources, including two in-lab expert-annotated sets. We systematically evaluate eight VLMs under few-shot conditions and compare variants with and without implicit test-time reasoning tokens. We further implement a hybrid Few-Shot Object Detection (FSOD) pipeline that combines a detection head with a VLM-based few-shot classifier, which enhances the few-shot performance of recent VLMs on our benchmark. Across datasets, we observe that zero-shot performance is weak due to the domain gap; however, few-shot support consistently improves detection, with marginal gains achieved after six shots. We observe that models with reasoning tokens are more effective for end-to-end localization, whereas simpler variants are more suitable for classifying pre-localized crops. Our results highlight in-context adaptation as a practical path for microscopy, and our benchmark provides a reproducible testbed for advancing open-vocabulary detection in biomedical imaging.

In-Context Adaptation of VLMs for Few-Shot Cell Detection in Optical Microscopy

TL;DR

This work addresses the challenge of applying vision-language foundation models to few-shot cell detection in biomedical microscopy by quantifying domain shift and in-context adaptation using the Micro-OD benchmark. It introduces 252 annotated images across 11 cell types from four sources and evaluates eight VLMs under zero-shot and few-shot regimes, plus a hybrid FSOD pipeline that decouples localization from classification. Key findings show zero-shot performance is weak due to domain gap, while few-shot support improves detection with diminishing returns after about six examples; the benefit of implicit reasoning tokens is task-dependent—helpful for end-to-end localization but not always for pre-localized crop classification—highlighting practical considerations for deploying open-vocabulary detectors in biology. The Micro-OD benchmark provides a reproducible platform to advance open-vocabulary microscopy detection and guide future improvements in model architectures and prompting strategies for biomedical imaging.

Abstract

Foundation vision-language models (VLMs) excel on natural images, but their utility for biomedical microscopy remains underexplored. In this paper, we investigate how in-context learning enables state-of-the-art VLMs to perform few-shot object detection when large annotated datasets are unavailable, as is often the case with microscopic images. We introduce the Micro-OD benchmark, a curated collection of 252 images specifically curated for in-context learning, with bounding-box annotations spanning 11 cell types across four sources, including two in-lab expert-annotated sets. We systematically evaluate eight VLMs under few-shot conditions and compare variants with and without implicit test-time reasoning tokens. We further implement a hybrid Few-Shot Object Detection (FSOD) pipeline that combines a detection head with a VLM-based few-shot classifier, which enhances the few-shot performance of recent VLMs on our benchmark. Across datasets, we observe that zero-shot performance is weak due to the domain gap; however, few-shot support consistently improves detection, with marginal gains achieved after six shots. We observe that models with reasoning tokens are more effective for end-to-end localization, whereas simpler variants are more suitable for classifying pre-localized crops. Our results highlight in-context adaptation as a practical path for microscopy, and our benchmark provides a reproducible testbed for advancing open-vocabulary detection in biomedical imaging.

Paper Structure

This paper contains 10 sections, 3 equations, 2 figures, 3 tables, 1 algorithm.

Figures (2)

  • Figure 1: An overview of the Few-Shot Object Detection (FSOD) pipeline using vision-language models (VLM) for cell shape detection. We process four different microscopy datasets (BBBC, BCCD, Live Cells, and NIH-3T3) by various VLMs using either a text-only or a combined text and image prompt to generate bounding box detections for different cell types.
  • Figure 2: The four experimental configurations used to evaluate Vision-Language Models (VLMs) on the Micro-OD benchmark. This figure illustrates the distinct setups: Zero Shot-T (text-only prompt), Few Shot-V (visual-only prompts), Few Shot-MMO (multi-modal few-shot detection), and Few Shot-MMC (a cascaded pipeline of localization with SAM followed by few-shot classification with VLMs).