Table of Contents
Fetching ...

Contrastive ground-level image and remote sensing pre-training improves representation learning for natural world imagery

Andy V. Huynh, Lauren E. Gillespie, Jael Lopez-Saucedo, Claire Tang, Rohan Sikand, Moisés Expósito-Alonso

TL;DR

It is shown how leveraging multiple views of image data with contrastive learning can improve downstream fine-grained classification performance for species recognition, even when one view is absent.

Abstract

Multimodal image-text contrastive learning has shown that joint representations can be learned across modalities. Here, we show how leveraging multiple views of image data with contrastive learning can improve downstream fine-grained classification performance for species recognition, even when one view is absent. We propose ContRastive Image-remote Sensing Pre-training (CRISP)$\unicode{x2014}$a new pre-training task for ground-level and aerial image representation learning of the natural world$\unicode{x2014}$and introduce Nature Multi-View (NMV), a dataset of natural world imagery including $>3$ million ground-level and aerial image pairs for over 6,000 plant taxa across the ecologically diverse state of California. The NMV dataset and accompanying material are available at hf.co/datasets/andyvhuynh/NatureMultiView.

Contrastive ground-level image and remote sensing pre-training improves representation learning for natural world imagery

TL;DR

It is shown how leveraging multiple views of image data with contrastive learning can improve downstream fine-grained classification performance for species recognition, even when one view is absent.

Abstract

Multimodal image-text contrastive learning has shown that joint representations can be learned across modalities. Here, we show how leveraging multiple views of image data with contrastive learning can improve downstream fine-grained classification performance for species recognition, even when one view is absent. We propose ContRastive Image-remote Sensing Pre-training (CRISP)a new pre-training task for ground-level and aerial image representation learning of the natural worldand introduce Nature Multi-View (NMV), a dataset of natural world imagery including million ground-level and aerial image pairs for over 6,000 plant taxa across the ecologically diverse state of California. The NMV dataset and accompanying material are available at hf.co/datasets/andyvhuynh/NatureMultiView.
Paper Structure (59 sections, 6 equations, 3 figures, 14 tables)

This paper contains 59 sections, 6 equations, 3 figures, 14 tables.

Figures (3)

  • Figure 1: Overcoming label scarcity for natural world imagery.a. Label inequity is a major problem for many publicly-available natural world imagery datasets, as many countries below the equator have relatively few label-quality observations per-unit area and biodiversity (e.g. South America, Africa). b. Images taken from the ground and from above at the same location often look visually similar and encode rich mutual information about natural world scenes and objects. c. The CRISP framework leverages this similarity to learn a joint representation between ground level-remote sensing image pairs so paired images from the same location (diagonal) have more similar representations than paired images from other locations (off-diagonal).
  • Figure 2: Overview of the Nature Multi-View dataset (NMV).a. the 1,755,602 observations in the NMV dataset. b. Labels for identified observations in the NMV dataset exhibit a long-tail across classes, as is common in natural world settings. c. NMV observation density is also not uniform across space, a consequence of the opportunistic nature of citizen science datasets like iNaturalist inat. d. A diverse set of ground level (top row) and paired aerial (bottom row) examples from the NMV dataset.
  • Figure 3: CRISP pre-trained representations recapitulate ecological expectations.a. UMAP projection of CRISP aerial view encoder embeddings in color space overlaid over map of aerial image locations. Aerial embeddings appear to be more similar for aerial images taken from similar ecosystems even if far away (see inset). b. UMAP projection of CRISP ground view encoder embeddings in color space overlaid over taxonomic tree of genera in dataset. Ground level embeddings appear to be more similar for more related genera, along with visually similar genera that may not be related (see inset).