Table of Contents
Fetching ...

BotaCLIP: Contrastive Learning for Botany-Aware Representation of Earth Observation Data

Selene Cerna, Sara Si-Moussi, Wilfried Thuiller, Hadrien Hendrikx, Vincent Miele

TL;DR

BotaCLIP tackles the problem of adapting Earth Observation foundation models to ecology by injecting domain-specific knowledge without full retraining. It aligns high-resolution EO image embeddings from a pretrained backbone with vegetation relevé embeddings using a sigmoid contrastive loss $\mathcal{L}_{\text{SCL}}$, augmented by a regularization term $\mathcal{R}$ that preserves local structure to avoid catastrophic forgetting. The approach demonstrates substantial gains across three ecological tasks—plant presence prediction, butterfly occurrence modeling, and soil trophic-group abundance—outperforming raw DOFA embeddings and supervised baselines, with notable improvements in TSS, BI, and Spearman's $\rho$. Embedding-space analyses show that BotaCLIP sharpens ecological structure while retaining global geometry, suggesting that domain-aware alignment can yield transferable representations in data-scarce ecological settings with low computational overhead. This lightweight, modular pipeline enables scalable biodiversity modeling and has potential applicability to agriculture and forestry through frugal, ecologically informed representations.

Abstract

Foundation models have demonstrated a remarkable ability to learn rich, transferable representations across diverse modalities such as images, text, and audio. In modern machine learning pipelines, these representations often replace raw data as the primary input for downstream tasks. In this paper, we address the challenge of adapting a pre-trained foundation model to inject domain-specific knowledge, without retraining from scratch or incurring significant computational costs. To this end, we introduce BotaCLIP, a lightweight multimodal contrastive framework that adapts a pre-trained Earth Observation foundation model (DOFA) by aligning high-resolution aerial imagery with botanical relevés. Unlike generic embeddings, BotaCLIP internalizes ecological structure through contrastive learning with a regularization strategy that mitigates catastrophic forgetting. Once trained, the resulting embeddings serve as transferable representations for downstream predictors. Motivated by real-world applications in biodiversity modeling, we evaluated BotaCLIP representations in three ecological tasks: plant presence prediction, butterfly occurrence modeling, and soil trophic group abundance estimation. The results showed consistent improvements over those derived from DOFA and supervised baselines. More broadly, this work illustrates how domain-aware adaptation of foundation models can inject expert knowledge into data-scarce settings, enabling frugal representation learning.

BotaCLIP: Contrastive Learning for Botany-Aware Representation of Earth Observation Data

TL;DR

BotaCLIP tackles the problem of adapting Earth Observation foundation models to ecology by injecting domain-specific knowledge without full retraining. It aligns high-resolution EO image embeddings from a pretrained backbone with vegetation relevé embeddings using a sigmoid contrastive loss , augmented by a regularization term that preserves local structure to avoid catastrophic forgetting. The approach demonstrates substantial gains across three ecological tasks—plant presence prediction, butterfly occurrence modeling, and soil trophic-group abundance—outperforming raw DOFA embeddings and supervised baselines, with notable improvements in TSS, BI, and Spearman's . Embedding-space analyses show that BotaCLIP sharpens ecological structure while retaining global geometry, suggesting that domain-aware alignment can yield transferable representations in data-scarce ecological settings with low computational overhead. This lightweight, modular pipeline enables scalable biodiversity modeling and has potential applicability to agriculture and forestry through frugal, ecologically informed representations.

Abstract

Foundation models have demonstrated a remarkable ability to learn rich, transferable representations across diverse modalities such as images, text, and audio. In modern machine learning pipelines, these representations often replace raw data as the primary input for downstream tasks. In this paper, we address the challenge of adapting a pre-trained foundation model to inject domain-specific knowledge, without retraining from scratch or incurring significant computational costs. To this end, we introduce BotaCLIP, a lightweight multimodal contrastive framework that adapts a pre-trained Earth Observation foundation model (DOFA) by aligning high-resolution aerial imagery with botanical relevés. Unlike generic embeddings, BotaCLIP internalizes ecological structure through contrastive learning with a regularization strategy that mitigates catastrophic forgetting. Once trained, the resulting embeddings serve as transferable representations for downstream predictors. Motivated by real-world applications in biodiversity modeling, we evaluated BotaCLIP representations in three ecological tasks: plant presence prediction, butterfly occurrence modeling, and soil trophic group abundance estimation. The results showed consistent improvements over those derived from DOFA and supervised baselines. More broadly, this work illustrates how domain-aware adaptation of foundation models can inject expert knowledge into data-scarce settings, enabling frugal representation learning.

Paper Structure

This paper contains 23 sections, 5 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Overview of the BotaCLIP framework. RGB orthophotos are encoded with the pre-trained ViT model DOFA and vegetation relevés with the pre-trained MLP model Botania. The two modalities are aligned with a contrastive objective regularized by the similarity structure of DOFA embeddings. After training, BotaCLIP embeddings are extracted from the image adapter using new orthophotos and serve as inputs for downstream tasks in plant, insect, and soil monitoring.
  • Figure 2: Performance of DOFA vs. BotaCLIP on plant (TSS), butterfly (BI), and soil (Spearman’s $\rho$) tasks. Scatter plots (left, middle) show per-species scores with the identity line as reference. The bar plot (right) shows mean correlations by trophic groups aggregated by biological categories. $\%\uparrow$ denotes average relative gain of BotaCLIP over DOFA.
  • Figure 3: UMAP 2D visualization of DOFA (left) and BotaCLIP (right) embeddings, colored by six broad landscape categories.