BarcodeMamba: State Space Models for Biodiversity Analysis
Tiancheng Gao, Graham W. Taylor
TL;DR
DNA barcode analysis for biodiversity faces scale and unseen-species challenges, especially in invertebrates. BarcodeMamba uses a Mamba-2 structured state space backbone with next-token prediction and flexible tokenization to efficiently model barcode sequences, achieving state-of-the-art-like performance with far fewer parameters than BarcodeBERT. Across a 1.5M Canadian invertebrate dataset, BarcodeMamba reaches 99.2% species-level accuracy in linear probing for seen species and 70.2% genus-level accuracy in 1-NN probing for unseen species when scaled to about 63.6% of BarcodeBERT's size, demonstrating strong zero-shot generalization and parameter efficiency. These results suggest that SSM-based barcode models can be practical for large-scale biodiversity surveys and enable scalable discovery of new taxa, with future work extending to BIOSCAN-5M and exploring bi-directional variants.
Abstract
DNA barcodes are crucial in biodiversity analysis for building automatic identification systems that recognize known species and discover unseen species. Unlike human genome modeling, barcode-based invertebrate identification poses challenges in the vast diversity of species and taxonomic complexity. Among Transformer-based foundation models, BarcodeBERT excelled in species-level identification of invertebrates, highlighting the effectiveness of self-supervised pretraining on barcode-specific datasets. Recently, structured state space models (SSMs) have emerged, with a time complexity that scales sub-quadratically with the context length. SSMs provide an efficient parameterization of sequence modeling relative to attention-based architectures. Given the success of Mamba and Mamba-2 in natural language, we designed BarcodeMamba, a performant and efficient foundation model for DNA barcodes in biodiversity analysis. We conducted a comprehensive ablation study on the impacts of self-supervised training and tokenization methods, and compared both versions of Mamba layers in terms of expressiveness and their capacity to identify "unseen" species held back from training. Our study shows that BarcodeMamba has better performance than BarcodeBERT even when using only 8.3% as many parameters, and improves accuracy to 99.2% on species-level accuracy in linear probing without fine-tuning for "seen" species. In our scaling study, BarcodeMamba with 63.6% of BarcodeBERT's parameters achieved 70.2% genus-level accuracy in 1-nearest neighbor (1-NN) probing for unseen species. The code repository to reproduce our experiments is available at https://github.com/bioscan-ml/BarcodeMamba.
