Table of Contents
Fetching ...

CLIBD: Bridging Vision and Genomics for Biodiversity Monitoring at Scale

ZeMing Gong, Austin T. Wang, Xiaoliang Huo, Joakim Bruslund Haurum, Scott C. Lowe, Graham W. Taylor, Angel X. Chang

TL;DR

CLIBD presents a tripartite contrastive learning framework that fuses images, DNA barcodes, and taxonomic text into a unified embedding space to advance scalable, zero-shot biodiversity monitoring. By aligning three modalities, the model improves fine-grained taxonomic classification and enables cross-modal retrieval, outperforming prior image–text only approaches like BioCLIP. Experiments on BIOSCAN-1M (and INSECT) show notable gains at genus and species levels, with DNA barcodes providing a particularly effective alignment target. The work demonstrates practical routes for open-set biodiversity identification and cross-modal queries, while also discussing cost and deployment considerations in real-world biomonitoring workflows. Together, these results suggest DNA-guided multimodal representations can significantly enhance scalable biodiversity analysis beyond insects alone.

Abstract

Measuring biodiversity is crucial for understanding ecosystem health. While prior works have developed machine learning models for taxonomic classification of photographic images and DNA separately, in this work, we introduce a multimodal approach combining both, using CLIP-style contrastive learning to align images, barcode DNA, and text-based representations of taxonomic labels in a unified embedding space. This allows for accurate classification of both known and unknown insect species without task-specific fine-tuning, leveraging contrastive learning for the first time to fuse barcode DNA and image data. Our method surpasses previous single-modality approaches in accuracy by over 8% on zero-shot learning tasks, showcasing its effectiveness in biodiversity studies.

CLIBD: Bridging Vision and Genomics for Biodiversity Monitoring at Scale

TL;DR

CLIBD presents a tripartite contrastive learning framework that fuses images, DNA barcodes, and taxonomic text into a unified embedding space to advance scalable, zero-shot biodiversity monitoring. By aligning three modalities, the model improves fine-grained taxonomic classification and enables cross-modal retrieval, outperforming prior image–text only approaches like BioCLIP. Experiments on BIOSCAN-1M (and INSECT) show notable gains at genus and species levels, with DNA barcodes providing a particularly effective alignment target. The work demonstrates practical routes for open-set biodiversity identification and cross-modal queries, while also discussing cost and deployment considerations in real-world biomonitoring workflows. Together, these results suggest DNA-guided multimodal representations can significantly enhance scalable biodiversity analysis beyond insects alone.

Abstract

Measuring biodiversity is crucial for understanding ecosystem health. While prior works have developed machine learning models for taxonomic classification of photographic images and DNA separately, in this work, we introduce a multimodal approach combining both, using CLIP-style contrastive learning to align images, barcode DNA, and text-based representations of taxonomic labels in a unified embedding space. This allows for accurate classification of both known and unknown insect species without task-specific fine-tuning, leveraging contrastive learning for the first time to fuse barcode DNA and image data. Our method surpasses previous single-modality approaches in accuracy by over 8% on zero-shot learning tasks, showcasing its effectiveness in biodiversity studies.
Paper Structure (32 sections, 2 equations, 15 figures, 22 tables)

This paper contains 32 sections, 2 equations, 15 figures, 22 tables.

Figures (15)

  • Figure 1: Overview of CLIBD.(a) Our model consists of three encoders for processing images, DNA barcodes, and text. During training, we use a contrastive loss to align the image, DNA, and text embeddings. (b) At inference, we embed a query image and match it to a database of existing image and DNA embeddings (keys). We use cosine similarity to find the closest key embedding and use its taxonomic label to classify the query.
  • Figure 2: Data partitioning. We split the BIOSCAN-1M data into training, validation, and test partitions. The training set (used for contrastive learning) has records without any species labels as well as a set of seen species that are well-represented (at least 9 records per species). The validation and test sets include seen and unseen (not seen during training) species. These images are further split into subpartitions of queries (darker color) and keys (lighter color) for evaluation. We ensure that the validation and test sets have different unseen species. Since the seen species are common, we have a shared set of records (Key) that we use as keys for seen species and combine all key sets to form a reference database. We show the number of records in each box.
  • Figure 3: Example query-key pairs. Top-3 nearest specimens from the unseen validation-key dataset retrieved based on the cosine-similarity for DNA-to-DNA, image-to-image, and image-to-DNA retrieval. Box color indicates whether the retrieved samples had the same species (green), genus (light-green), family (yellow), or order (orange) as the query or, else not matched (red).
  • Figure 4: Average top-1 per-species accuracy, binned by count of species records in the key set, for different query and key combinations. We show seen in blue and unseen in orange, with size and color intensity indicating number of binned species. In all cases, the accuracy for seen species increases with the number of records. While the trend is similar for unseen species with intra-modal retrieval (a) and d)), cross-modal retrieval (b) and c)) achieve much lower performance, underscoring the challenge of cross-modal alignment.
  • Figure 5: We visualize the attention for queries from seen and unseen species. The "Before" and "After" columns indicate if the prediction (at the species-level) was correct before (initial unaligned model) and after alignment (I+D+T model). Predicted genera + species are indicated in green for correct, red for incorrect. Only a few samples were predicted correctly before alignment and incorrect after (38 for seen and 69 for unseen).
  • ...and 10 more figures