Table of Contents
Fetching ...

ProM3E: Probabilistic Masked MultiModal Embedding Model for Ecology

Srikumar Sastry, Subash Khanal, Aayush Dhakal, Jiayu Lin, Dan Cher, Phoenix Jarosz, Nathan Jacobs

TL;DR

ProM3E introduces a probabilistic masked multimodal embedding framework for ecology that enables any-to-any generation of representations and modality inversion, formulated through a joint latent distribution $\mathcal{Z}_i \sim p_\mathcal{E}(\mathcal{Z}|\mathcal{G})$ and latent sampling $\mathcal{Z}_i = \mu_\mathcal{G} + \sigma_\mathcal{G} \epsilon_i$. It executes a two-stage process: first align modality-specific encoders into a unified embedding space, then train a Masked Variational Autoencoder (MVAE) to reconstruct masked modalities using a contrastive reconstruction loss and a variational information bottleneck. The framework demonstrates superior cross-modal retrieval, linear probing, and habitat mapping across ecology-heavy benchmarks while providing insights into modality informativeness via $||\sigma||_1$ and modality-gap dynamics, revealing that additional modalities reduce uncertainty and alignment gaps. By enabling any-to-any generation with reduced paired-data needs, ProM3E offers scalable ecological representation learning and rich geospatial interpretations, including species distribution and biodiversity mapping, with code and data released for reproducibility.

Abstract

We introduce ProM3E, a probabilistic masked multimodal embedding model for any-to-any generation of multimodal representations for ecology. ProM3E is based on masked modality reconstruction in the embedding space, learning to infer missing modalities given a few context modalities. By design, our model supports modality inversion in the embedding space. The probabilistic nature of our model allows us to analyse the feasibility of fusing various modalities for given downstream tasks, essentially learning what to fuse. Using these features of our model, we propose a novel cross-modal retrieval approach that mixes inter-modal and intra-modal similarities to achieve superior performance across all retrieval tasks. We further leverage the hidden representation from our model to perform linear probing tasks and demonstrate the superior representation learning capability of our model. All our code, datasets and model will be released at https://vishu26.github.io/prom3e.

ProM3E: Probabilistic Masked MultiModal Embedding Model for Ecology

TL;DR

ProM3E introduces a probabilistic masked multimodal embedding framework for ecology that enables any-to-any generation of representations and modality inversion, formulated through a joint latent distribution and latent sampling . It executes a two-stage process: first align modality-specific encoders into a unified embedding space, then train a Masked Variational Autoencoder (MVAE) to reconstruct masked modalities using a contrastive reconstruction loss and a variational information bottleneck. The framework demonstrates superior cross-modal retrieval, linear probing, and habitat mapping across ecology-heavy benchmarks while providing insights into modality informativeness via and modality-gap dynamics, revealing that additional modalities reduce uncertainty and alignment gaps. By enabling any-to-any generation with reduced paired-data needs, ProM3E offers scalable ecological representation learning and rich geospatial interpretations, including species distribution and biodiversity mapping, with code and data released for reproducibility.

Abstract

We introduce ProM3E, a probabilistic masked multimodal embedding model for any-to-any generation of multimodal representations for ecology. ProM3E is based on masked modality reconstruction in the embedding space, learning to infer missing modalities given a few context modalities. By design, our model supports modality inversion in the embedding space. The probabilistic nature of our model allows us to analyse the feasibility of fusing various modalities for given downstream tasks, essentially learning what to fuse. Using these features of our model, we propose a novel cross-modal retrieval approach that mixes inter-modal and intra-modal similarities to achieve superior performance across all retrieval tasks. We further leverage the hidden representation from our model to perform linear probing tasks and demonstrate the superior representation learning capability of our model. All our code, datasets and model will be released at https://vishu26.github.io/prom3e.

Paper Structure

This paper contains 31 sections, 6 equations, 75 figures, 33 tables.

Figures (75)

  • Figure 1: ProM3E Overview. The versatile capabilities of our model with the ability to accept arbitrary input modalities.
  • Figure 2: ProM3E Framework. Using embeddings obtained from aligned modality-specific encoders, we model the probability distribution of input modalities using a masked variational autoencoder framework. Subsequently, we utilize the predicted variational distribution of input modalities to reconstruct the embeddings of masked modalities.
  • Figure 3: Mean $||\sigma||_1$ values of geographic locations at various percentile intervals of the Shannon Diversity Index derived from the iNaturalist dataset.
  • Figure 7: ICA Plot of Location Embeddings. We visually compare embeddings obtained from various tokens in the hidden representation of our model with the representation from TaxaBind. We notice that each register token captures different information.
  • Figure 8: ICA Plot of Satellite Image Embeddings. Similary, we compare satellite image embeddings with TaxaBind and notice register tokens capture diverse information.
  • ...and 70 more figures