Contrastive ground-level image and remote sensing pre-training improves representation learning for natural world imagery

Andy V. Huynh; Lauren E. Gillespie; Jael Lopez-Saucedo; Claire Tang; Rohan Sikand; Moisés Expósito-Alonso

Contrastive ground-level image and remote sensing pre-training improves representation learning for natural world imagery

Andy V. Huynh, Lauren E. Gillespie, Jael Lopez-Saucedo, Claire Tang, Rohan Sikand, Moisés Expósito-Alonso

TL;DR

It is shown how leveraging multiple views of image data with contrastive learning can improve downstream fine-grained classification performance for species recognition, even when one view is absent.

Abstract

Multimodal image-text contrastive learning has shown that joint representations can be learned across modalities. Here, we show how leveraging multiple views of image data with contrastive learning can improve downstream fine-grained classification performance for species recognition, even when one view is absent. We propose ContRastive Image-remote Sensing Pre-training (CRISP)$\unicode{x2014}$a new pre-training task for ground-level and aerial image representation learning of the natural world$\unicode{x2014}$and introduce Nature Multi-View (NMV), a dataset of natural world imagery including $>3$ million ground-level and aerial image pairs for over 6,000 plant taxa across the ecologically diverse state of California. The NMV dataset and accompanying material are available at hf.co/datasets/andyvhuynh/NatureMultiView.

Contrastive ground-level image and remote sensing pre-training improves representation learning for natural world imagery

TL;DR

It is shown how leveraging multiple views of image data with contrastive learning can improve downstream fine-grained classification performance for species recognition, even when one view is absent.

Abstract

a new pre-training task for ground-level and aerial image representation learning of the natural world

and introduce Nature Multi-View (NMV), a dataset of natural world imagery including

million ground-level and aerial image pairs for over 6,000 plant taxa across the ecologically diverse state of California. The NMV dataset and accompanying material are available at hf.co/datasets/andyvhuynh/NatureMultiView.

Paper Structure (59 sections, 6 equations, 3 figures, 14 tables)

This paper contains 59 sections, 6 equations, 3 figures, 14 tables.

Introduction
Related Work
Representation learning for fine-grained species recognition
Representation learning for remote sensing imagery
The Nature Multi-View Dataset
Nature Multi-View Dataset curation
Dataset limitations
The CRISP Framework
Standard CRISP objective
CRISP objective with remote sensing data augmentation
Many-to-one CRISP objective
Parameterized CRISP objective
Experiments
Fine-grained species recognition
Species distribution mapping
...and 44 more sections

Figures (3)

Figure 1: Overcoming label scarcity for natural world imagery.a. Label inequity is a major problem for many publicly-available natural world imagery datasets, as many countries below the equator have relatively few label-quality observations per-unit area and biodiversity (e.g. South America, Africa). b. Images taken from the ground and from above at the same location often look visually similar and encode rich mutual information about natural world scenes and objects. c. The CRISP framework leverages this similarity to learn a joint representation between ground level-remote sensing image pairs so paired images from the same location (diagonal) have more similar representations than paired images from other locations (off-diagonal).
Figure 2: Overview of the Nature Multi-View dataset (NMV).a. the 1,755,602 observations in the NMV dataset. b. Labels for identified observations in the NMV dataset exhibit a long-tail across classes, as is common in natural world settings. c. NMV observation density is also not uniform across space, a consequence of the opportunistic nature of citizen science datasets like iNaturalist inat. d. A diverse set of ground level (top row) and paired aerial (bottom row) examples from the NMV dataset.
Figure 3: CRISP pre-trained representations recapitulate ecological expectations.a. UMAP projection of CRISP aerial view encoder embeddings in color space overlaid over map of aerial image locations. Aerial embeddings appear to be more similar for aerial images taken from similar ecosystems even if far away (see inset). b. UMAP projection of CRISP ground view encoder embeddings in color space overlaid over taxonomic tree of genera in dataset. Ground level embeddings appear to be more similar for more related genera, along with visually similar genera that may not be related (see inset).

Contrastive ground-level image and remote sensing pre-training improves representation learning for natural world imagery

TL;DR

Abstract

Contrastive ground-level image and remote sensing pre-training improves representation learning for natural world imagery

Authors

TL;DR

Abstract

Table of Contents

Figures (3)