Table of Contents
Fetching ...

CATALOG: A Camera Trap Language-guided Contrastive Learning Model

Julian D. Santamaria, Claudia Isaza, Jhony H. Giraldo

TL;DR

The paper tackles domain shift in camera-trap species recognition by fusing text and image representations from multiple foundation models. It introduces CATALOG, which builds centroid-based textual embeddings, aligns text, image, and image-text features via a convex fusion S = αW + (1−α)Q, and trains with a contrastive loss while keeping FMs frozen and only the MLP trainable. Evaluations on Snapshot Serengeti and Terra Incognita show state-of-the-art performance under cross-domain conditions and highlight the importance of CLIP-based visuals, LLM-derived descriptions, and template-based text. The results demonstrate the practical potential of multimodal foundation-model fusion for robust wildlife monitoring with open vocabulary capabilities.

Abstract

Foundation Models (FMs) have been successful in various computer vision tasks like image classification, object detection and image segmentation. However, these tasks remain challenging when these models are tested on datasets with different distributions from the training dataset, a problem known as domain shift. This is especially problematic for recognizing animal species in camera-trap images where we have variability in factors like lighting, camouflage and occlusions. In this paper, we propose the Camera Trap Language-guided Contrastive Learning (CATALOG) model to address these issues. Our approach combines multiple FMs to extract visual and textual features from camera-trap data and uses a contrastive loss function to train the model. We evaluate CATALOG on two benchmark datasets and show that it outperforms previous state-of-the-art methods in camera-trap image recognition, especially when the training and testing data have different animal species or come from different geographical areas. Our approach demonstrates the potential of using FMs in combination with multi-modal fusion and contrastive learning for addressing domain shifts in camera-trap image recognition. The code of CATALOG is publicly available at https://github.com/Julian075/CATALOG.

CATALOG: A Camera Trap Language-guided Contrastive Learning Model

TL;DR

The paper tackles domain shift in camera-trap species recognition by fusing text and image representations from multiple foundation models. It introduces CATALOG, which builds centroid-based textual embeddings, aligns text, image, and image-text features via a convex fusion S = αW + (1−α)Q, and trains with a contrastive loss while keeping FMs frozen and only the MLP trainable. Evaluations on Snapshot Serengeti and Terra Incognita show state-of-the-art performance under cross-domain conditions and highlight the importance of CLIP-based visuals, LLM-derived descriptions, and template-based text. The results demonstrate the practical potential of multimodal foundation-model fusion for robust wildlife monitoring with open vocabulary capabilities.

Abstract

Foundation Models (FMs) have been successful in various computer vision tasks like image classification, object detection and image segmentation. However, these tasks remain challenging when these models are tested on datasets with different distributions from the training dataset, a problem known as domain shift. This is especially problematic for recognizing animal species in camera-trap images where we have variability in factors like lighting, camouflage and occlusions. In this paper, we propose the Camera Trap Language-guided Contrastive Learning (CATALOG) model to address these issues. Our approach combines multiple FMs to extract visual and textual features from camera-trap data and uses a contrastive loss function to train the model. We evaluate CATALOG on two benchmark datasets and show that it outperforms previous state-of-the-art methods in camera-trap image recognition, especially when the training and testing data have different animal species or come from different geographical areas. Our approach demonstrates the potential of using FMs in combination with multi-modal fusion and contrastive learning for addressing domain shifts in camera-trap image recognition. The code of CATALOG is publicly available at https://github.com/Julian075/CATALOG.

Paper Structure

This paper contains 15 sections, 5 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Comparison of CATALOG, BioCLIP stevens2024bioclip, and WildCLIP gabeff2024wildclip under challenging camera-trap conditions. CATALOG demonstrates superior performance.
  • Figure 2: The pipeline of CATALOG. Our model is divided into five parts: i) text embeddings, ii) image embeddings, iii) image-text embeddings, iv) feature alignment, and v) loss function. The textual embeddings are computed using a set of pre-defined templates and the LLM descriptions. The image embeddings are computed using CLIP. The image-text embeddings are calculated using LLaVA and BERT. Finally, we align the multi-modal features using an alignment mechanism and train the model with a contrastive loss function.
  • Figure 3: Cropped images from the Snapshot Serengeti and Terra Incognita datasets where we observe the domain shift and the difference in classes (different animal species).
  • Figure 4: Sensibility analysis of the hyperparameter $\alpha$ of CATALOG in the Terra Incognita dataset for out-of-domain evaluation.