Table of Contents
Fetching ...

Towards Zero-Shot Camera Trap Image Categorization

Jiří Vyskočil, Lukas Picek

TL;DR

Evaluation of the search for an alternative approach to the automatic categorization of camera trap images revealed that combining MegaDetector with two separate classifiers achieves the highest accuracy, which highlights the potential of zero-shot approaches for camera trap image categorization.

Abstract

This paper describes the search for an alternative approach to the automatic categorization of camera trap images. First, we benchmark state-of-the-art classifiers using a single model for all images. Next, we evaluate methods combining MegaDetector with one or more classifiers and Segment Anything to assess their impact on reducing location-specific overfitting. Last, we propose and test two approaches using large language and foundational models, such as DINOv2, BioCLIP, BLIP, and ChatGPT, in a zero-shot scenario. Evaluation carried out on two publicly available datasets (WCT from New Zealand, CCT20 from the Southwestern US) and a private dataset (CEF from Central Europe) revealed that combining MegaDetector with two separate classifiers achieves the highest accuracy. This approach reduced the relative error of a single BEiTV2 classifier by approximately 42\% on CCT20, 48\% on CEF, and 75\% on WCT. Besides, as the background is removed, the error in terms of accuracy in new locations is reduced to half. The proposed zero-shot pipeline based on DINOv2 and FAISS achieved competitive results (1.0\% and 4.7\% smaller on CCT20, and CEF, respectively), which highlights the potential of zero-shot approaches for camera trap image categorization.

Towards Zero-Shot Camera Trap Image Categorization

TL;DR

Evaluation of the search for an alternative approach to the automatic categorization of camera trap images revealed that combining MegaDetector with two separate classifiers achieves the highest accuracy, which highlights the potential of zero-shot approaches for camera trap image categorization.

Abstract

This paper describes the search for an alternative approach to the automatic categorization of camera trap images. First, we benchmark state-of-the-art classifiers using a single model for all images. Next, we evaluate methods combining MegaDetector with one or more classifiers and Segment Anything to assess their impact on reducing location-specific overfitting. Last, we propose and test two approaches using large language and foundational models, such as DINOv2, BioCLIP, BLIP, and ChatGPT, in a zero-shot scenario. Evaluation carried out on two publicly available datasets (WCT from New Zealand, CCT20 from the Southwestern US) and a private dataset (CEF from Central Europe) revealed that combining MegaDetector with two separate classifiers achieves the highest accuracy. This approach reduced the relative error of a single BEiTV2 classifier by approximately 42\% on CCT20, 48\% on CEF, and 75\% on WCT. Besides, as the background is removed, the error in terms of accuracy in new locations is reduced to half. The proposed zero-shot pipeline based on DINOv2 and FAISS achieved competitive results (1.0\% and 4.7\% smaller on CCT20, and CEF, respectively), which highlights the potential of zero-shot approaches for camera trap image categorization.

Paper Structure

This paper contains 11 sections, 4 figures, 10 tables.

Figures (4)

  • Figure 1: Four baseline approaches illustration. (a) Standard image classification approach where images are resized and fed to a trained image classifier. (b) A more recent two-stage approach with object detection before the classification. This approach can suffer from missing reject options for images with no detection. (c) With two image classifiers, the problem of (b) can be easily mitigated. (d) Excluding background pixels using SAM could help to prevent overfitting to the location.
  • Figure 2: Zero-shot segmentation with Segment Anything. Random samples from CCT20 datasets are processed by MegaDetector, and resulting detections are fed into SAM. Even with the poor quality of the data, the zero-shot segmentation performs relatively well across a wide range of species. However, with the infra-red images and the small size of an object, the SAM starts to fail (3$^{rd}$ and $4^{th}$ column from the left).
  • Figure 3: Zero-Shot approaches. (e) Multi-modal methods extract image info. based on given textual and image prompts. (f) Embeddings are generated from the training set, and during inference, selective search finds similar images from the database.
  • Figure 4: Top3 closest images to given inputs using DINOv2$_G$ embeddings.