Table of Contents
Fetching ...

TaxaBind: A Unified Embedding Space for Ecological Applications

Srikumar Sastry, Subash Khanal, Aayush Dhakal, Adeel Ahmad, Nathan Jacobs

TL;DR

This work presents TaxaBind, a unified embedding space for characterizing any species of interest, and proposes multimodal patching, a technique for effectively distilling the knowledge from various modalities into the binding modality.

Abstract

We present TaxaBind, a unified embedding space for characterizing any species of interest. TaxaBind is a multimodal embedding space across six modalities: ground-level images of species, geographic location, satellite image, text, audio, and environmental features, useful for solving ecological problems. To learn this joint embedding space, we leverage ground-level images of species as a binding modality. We propose multimodal patching, a technique for effectively distilling the knowledge from various modalities into the binding modality. We construct two large datasets for pretraining: iSatNat with species images and satellite images, and iSoundNat with species images and audio. Additionally, we introduce TaxaBench-8k, a diverse multimodal dataset with six paired modalities for evaluating deep learning models on ecological tasks. Experiments with TaxaBind demonstrate its strong zero-shot and emergent capabilities on a range of tasks including species classification, cross-model retrieval, and audio classification. The datasets and models are made available at https://github.com/mvrl/TaxaBind.

TaxaBind: A Unified Embedding Space for Ecological Applications

TL;DR

This work presents TaxaBind, a unified embedding space for characterizing any species of interest, and proposes multimodal patching, a technique for effectively distilling the knowledge from various modalities into the binding modality.

Abstract

We present TaxaBind, a unified embedding space for characterizing any species of interest. TaxaBind is a multimodal embedding space across six modalities: ground-level images of species, geographic location, satellite image, text, audio, and environmental features, useful for solving ecological problems. To learn this joint embedding space, we leverage ground-level images of species as a binding modality. We propose multimodal patching, a technique for effectively distilling the knowledge from various modalities into the binding modality. We construct two large datasets for pretraining: iSatNat with species images and satellite images, and iSoundNat with species images and audio. Additionally, we introduce TaxaBench-8k, a diverse multimodal dataset with six paired modalities for evaluating deep learning models on ecological tasks. Experiments with TaxaBind demonstrate its strong zero-shot and emergent capabilities on a range of tasks including species classification, cross-model retrieval, and audio classification. The datasets and models are made available at https://github.com/mvrl/TaxaBind.

Paper Structure

This paper contains 18 sections, 2 equations, 8 figures, 8 tables, 1 algorithm.

Figures (8)

  • Figure 1: TaxaBind Framework. To create a unified embedding space consisting of different modalities, we exploit ground-level images of species as the binding modality. We use various ground-level image-paired datasets to train modality-specific encoders. Ultimately, the encoders support embedding arithmetic and exhibit emergent properties and zero-shot capabilities.
  • Figure 2: Multimodal Patching. For distilling unique information from different modalities, we patch the encoders using zero-shot classification with text. Note that since the network $f$ is shared across all modalities, it is patched using techniques like sequential patching or parallel patching.
  • Figure 3: Patching improves zero-shot classification performance with text. We evaluate the zero-shot classification accuracy of the ground-level image encoder with different values of $\alpha$ on iNat-2021. We observe performance improvements in all the cases.
  • Figure 4: Species image to satellite image retrieval task. For each example, we show the top 4 most similar satellite images retrieved by our model from a gallery of 100k satellite images in the iSatNat-test set.
  • Figure 5: Zero-shot Species Distribution Map. We create a species distribution map of Cardinalis cardinalis using a query ground-level image and combination of various modalities across the USA.
  • ...and 3 more figures