Table of Contents
Fetching ...

Hierarchy-Guided Multimodal Representation Learning for Taxonomic Inference

Sk Miraj Ahmed, Xi Yu, Yunqi Li, Yuewei Lin, Wei Xu

Abstract

Accurate biodiversity identification from large-scale field data is a foundational problem with direct impact on ecology, conservation, and environmental monitoring. In practice, the core task is taxonomic prediction - inferring order, family, genus, or species from imperfect inputs such as specimen images, DNA barcodes, or both. Existing multimodal methods often treat taxonomy as a flat label space and therefore fail to encode the hierarchical structure of biological classification, which is critical for robustness under noise and missing modalities. We present two end-to-end variants for hierarchy-aware multimodal learning: CLiBD-HiR, which introduces Hierarchical Information Regularization (HiR) to shape embedding geometry across taxonomic levels, yielding structured and noise-robust representations; and CLiBD-HiR-Fuse, which additionally trains a lightweight fusion predictor that supports image-only, DNA-only, or joint inference and is resilient to modality corruption. Across large-scale biodiversity benchmarks, our approach improves taxonomic classification accuracy by over 14 percent compared to strong multimodal baselines, with particularly large gains under partial and corrupted DNA conditions. These results highlight that explicitly encoding biological hierarchy, together with flexible fusion, is key for practical biodiversity foundation models.

Hierarchy-Guided Multimodal Representation Learning for Taxonomic Inference

Abstract

Accurate biodiversity identification from large-scale field data is a foundational problem with direct impact on ecology, conservation, and environmental monitoring. In practice, the core task is taxonomic prediction - inferring order, family, genus, or species from imperfect inputs such as specimen images, DNA barcodes, or both. Existing multimodal methods often treat taxonomy as a flat label space and therefore fail to encode the hierarchical structure of biological classification, which is critical for robustness under noise and missing modalities. We present two end-to-end variants for hierarchy-aware multimodal learning: CLiBD-HiR, which introduces Hierarchical Information Regularization (HiR) to shape embedding geometry across taxonomic levels, yielding structured and noise-robust representations; and CLiBD-HiR-Fuse, which additionally trains a lightweight fusion predictor that supports image-only, DNA-only, or joint inference and is resilient to modality corruption. Across large-scale biodiversity benchmarks, our approach improves taxonomic classification accuracy by over 14 percent compared to strong multimodal baselines, with particularly large gains under partial and corrupted DNA conditions. These results highlight that explicitly encoding biological hierarchy, together with flexible fusion, is key for practical biodiversity foundation models.

Paper Structure

This paper contains 22 sections, 13 equations, 2 figures, 2 tables, 2 algorithms.

Figures (2)

  • Figure 1: Effect of hierarchical regularization on embedding geometry and noise robustness (CLIBD-HiR, variant 1).Left: without the HiR loss, standard contrastive training treats mismatched taxa uniformly, yielding no explicit geometric relationship between intra-genus distances ($d_1$; different species within the same genus) and inter-genus / higher-level distances ($d_2,d_3$). Under realistic noise, a perturbed query embedding may drift across arbitrary clusters, leading to errors that can propagate to higher taxonomic ranks. Right: with HiR, the loss explicitly enforces a hierarchy-consistent structure ($d_1 < d_2 < d_3$), so nearby neighborhoods reflect taxonomic proximity. Consequently, even when noise causes a species-level mistake, predictions are more likely to remain correct at coarser levels (genus/family/order), improving robustness.
  • Figure 2: CLIBD-HiR-Fuse framework (Algorithm variant 2). Given a specimen image and its DNA barcode, we encode each modality with an image encoder and a DNA encoder, and embed the taxonomy prompt with a frozen BioCLIP text encoder. We align image--text and DNA--text representations using CLIP-style contrastive learning, and enforce hierarchy-aware structure with a hierarchical loss over augmented image views. A lightweight GatedFusion module adaptively combines image and DNA embeddings into a fused representation, which is additionally aligned to the fixed text embedding space via a fused-to-text contrastive objective.