CATALOG: A Camera Trap Language-guided Contrastive Learning Model
Julian D. Santamaria, Claudia Isaza, Jhony H. Giraldo
TL;DR
The paper tackles domain shift in camera-trap species recognition by fusing text and image representations from multiple foundation models. It introduces CATALOG, which builds centroid-based textual embeddings, aligns text, image, and image-text features via a convex fusion S = αW + (1−α)Q, and trains with a contrastive loss while keeping FMs frozen and only the MLP trainable. Evaluations on Snapshot Serengeti and Terra Incognita show state-of-the-art performance under cross-domain conditions and highlight the importance of CLIP-based visuals, LLM-derived descriptions, and template-based text. The results demonstrate the practical potential of multimodal foundation-model fusion for robust wildlife monitoring with open vocabulary capabilities.
Abstract
Foundation Models (FMs) have been successful in various computer vision tasks like image classification, object detection and image segmentation. However, these tasks remain challenging when these models are tested on datasets with different distributions from the training dataset, a problem known as domain shift. This is especially problematic for recognizing animal species in camera-trap images where we have variability in factors like lighting, camouflage and occlusions. In this paper, we propose the Camera Trap Language-guided Contrastive Learning (CATALOG) model to address these issues. Our approach combines multiple FMs to extract visual and textual features from camera-trap data and uses a contrastive loss function to train the model. We evaluate CATALOG on two benchmark datasets and show that it outperforms previous state-of-the-art methods in camera-trap image recognition, especially when the training and testing data have different animal species or come from different geographical areas. Our approach demonstrates the potential of using FMs in combination with multi-modal fusion and contrastive learning for addressing domain shifts in camera-trap image recognition. The code of CATALOG is publicly available at https://github.com/Julian075/CATALOG.
