Table of Contents
Fetching ...

Automatic Image-Level Morphological Trait Annotation for Organismal Images

Vardaan Pahuja, Samuel Stevens, Alyson East, Sydne Record, Yu Su

Abstract

Morphological traits are physical characteristics of biological organisms that provide vital clues on how organisms interact with their environment. Yet extracting these traits remains a slow, expert-driven process, limiting their use in large-scale ecological studies. A major bottleneck is the absence of high-quality datasets linking biological images to trait-level annotations. In this work, we demonstrate that sparse autoencoders trained on foundation-model features yield monosemantic, spatially grounded neurons that consistently activate on meaningful morphological parts. Leveraging this property, we introduce a trait annotation pipeline that localizes salient regions and uses vision-language prompting to generate interpretable trait descriptions. Using this approach, we construct Bioscan-Traits, a dataset of 80K trait annotations spanning 19K insect images from BIOSCAN-5M. Human evaluation confirms the biological plausibility of the generated morphological descriptions. We assess design sensitivity through a comprehensive ablation study, systematically varying key design choices and measuring their impact on the quality of the resulting trait descriptions. By annotating traits with a modular pipeline rather than prohibitively expensive manual efforts, we offer a scalable way to inject biologically meaningful supervision into foundation models, enable large-scale morphological analyses, and bridge the gap between ecological relevance and machine-learning practicality.

Automatic Image-Level Morphological Trait Annotation for Organismal Images

Abstract

Morphological traits are physical characteristics of biological organisms that provide vital clues on how organisms interact with their environment. Yet extracting these traits remains a slow, expert-driven process, limiting their use in large-scale ecological studies. A major bottleneck is the absence of high-quality datasets linking biological images to trait-level annotations. In this work, we demonstrate that sparse autoencoders trained on foundation-model features yield monosemantic, spatially grounded neurons that consistently activate on meaningful morphological parts. Leveraging this property, we introduce a trait annotation pipeline that localizes salient regions and uses vision-language prompting to generate interpretable trait descriptions. Using this approach, we construct Bioscan-Traits, a dataset of 80K trait annotations spanning 19K insect images from BIOSCAN-5M. Human evaluation confirms the biological plausibility of the generated morphological descriptions. We assess design sensitivity through a comprehensive ablation study, systematically varying key design choices and measuring their impact on the quality of the resulting trait descriptions. By annotating traits with a modular pipeline rather than prohibitively expensive manual efforts, we offer a scalable way to inject biologically meaningful supervision into foundation models, enable large-scale morphological analyses, and bridge the gap between ecological relevance and machine-learning practicality.

Paper Structure

This paper contains 35 sections, 3 equations, 25 figures, 13 tables, 1 algorithm.

Figures (25)

  • Figure 1: Given an input specimen image, we first compute dense visual representations using an off-the-shelf backbone (e.g., DINOv2). These features are passed through a pre-trained sparse autoencoder (SAE), which identifies high-activation latent units corresponding to semantically meaningful regions (Algorithm \ref{['alg:trait-extraction']}). We extract the spatial masks associated with these activations and overlay them on the original image to localize trait-relevant boxes. Finally, a multimodal language model (MLLM) is prompted with the annotated image to generate fine-grained morphological trait descriptions. This results in a large-scale, automatically labeled image-level trait dataset.
  • Figure 2: Comparison of trait localization for Thymoites guanicae. Bioscan-Traits (left) generates interpretable trait descriptions that are tied to clear, specific anatomical structures. In contrast, Grad-CAM (center) produces diffuse heatmaps that highlight broad body areas without species-level disentanglement.
  • Figure 3: Comparison of salient morphological trait description generation using a just MLLM vs. MLLM + SAE ($t_{\text{freq}} = 1e-2$) for Agyneta straminicola. Each red box highlights a region selected by SAE neurons with high activation, indicating regions used for prompting the MLLM + SAE. The use of SAE helps MLLMs focus on salient morphological traits rather than general descriptions of all body parts.
  • Figure 4: Comparison of salient morphological trait description generation using a single image vs. three images for Contacyphon ochraceus. Each red box highlights a region selected by SAE neurons with high activation, indicating regions used for prompting the MLLM + SAE. The use of multiple images yields a concise and taxonomically meaningful output, isolating traits with clearer morphological grounding.
  • Figure 5: Neurons 4852 and 13860 in SAE get activated at the wings and antennae of insects, respectively. The labels denote the highest annotated taxonomic level. Additional examples are shown in Appendix \ref{['sec:add_neuron_act']}.
  • ...and 20 more figures