Table of Contents
Fetching ...

Efficient Curation of Invertebrate Image Datasets Using Feature Embeddings and Automatic Size Comparison

Mikko Impiö, Philipp M. Rehsen, Jenni Raitoharju

TL;DR

The paper tackles the challenge of curating large invertebrate image datasets for environmental monitoring by introducing two complementary, training-light approaches: a content-based method using feature embeddings and a simple size-based analysis. It formalizes a prototype-based ranking using $\mathbf{m}^c = \frac{1}{N^c}\sum_i \mathbf{z}_i^c$ and $d_i^c = d_{\cos}(\mathbf{z}_i^c, \mathbf{m}^c)$, alongside an area-based score $a_{\Delta,i}^c = \frac{|a_i^c-\bar{a}^c|}{\bar{a}^c}$, enabling effective outlier detection without dataset-specific training. A novel benchmark dataset with 90,380 images across 24 categories and four outlier classes is released, together with three new human-centric evaluation metrics to quantify curation effort. Results on BIODISCOVER-derived data show the methods are broadly competitive and complementary, with embeddings excelling at bubbles and forceps while size-based sizing captures detached parts and misclassifications, offering practical utility for automated biomonitoring systems.

Abstract

The amount of image datasets collected for environmental monitoring purposes has increased in the past years as computer vision assisted methods have gained interest. Computer vision applications rely on high-quality datasets, making data curation important. However, data curation is often done ad-hoc and the methods used are rarely published. We present a method for curating large-scale image datasets of invertebrates that contain multiple images of the same taxa and/or specimens and have relatively uniform background in the images. Our approach is based on extracting feature embeddings with pretrained deep neural networks, and using these embeddings to find visually most distinct images by comparing their embeddings to the group prototype embedding. Also, we show that a simple area-based size comparison approach is able to find a lot of common erroneous images, such as images containing detached body parts and misclassified samples. In addition to the method, we propose using novel metrics for evaluating human-in-the-loop outlier detection methods. The implementations of the proposed curation methods, as well as a benchmark dataset containing annotated erroneous images, are publicly available in https://github.com/mikkoim/taxonomist-studio.

Efficient Curation of Invertebrate Image Datasets Using Feature Embeddings and Automatic Size Comparison

TL;DR

The paper tackles the challenge of curating large invertebrate image datasets for environmental monitoring by introducing two complementary, training-light approaches: a content-based method using feature embeddings and a simple size-based analysis. It formalizes a prototype-based ranking using and , alongside an area-based score , enabling effective outlier detection without dataset-specific training. A novel benchmark dataset with 90,380 images across 24 categories and four outlier classes is released, together with three new human-centric evaluation metrics to quantify curation effort. Results on BIODISCOVER-derived data show the methods are broadly competitive and complementary, with embeddings excelling at bubbles and forceps while size-based sizing captures detached parts and misclassifications, offering practical utility for automated biomonitoring systems.

Abstract

The amount of image datasets collected for environmental monitoring purposes has increased in the past years as computer vision assisted methods have gained interest. Computer vision applications rely on high-quality datasets, making data curation important. However, data curation is often done ad-hoc and the methods used are rarely published. We present a method for curating large-scale image datasets of invertebrates that contain multiple images of the same taxa and/or specimens and have relatively uniform background in the images. Our approach is based on extracting feature embeddings with pretrained deep neural networks, and using these embeddings to find visually most distinct images by comparing their embeddings to the group prototype embedding. Also, we show that a simple area-based size comparison approach is able to find a lot of common erroneous images, such as images containing detached body parts and misclassified samples. In addition to the method, we propose using novel metrics for evaluating human-in-the-loop outlier detection methods. The implementations of the proposed curation methods, as well as a benchmark dataset containing annotated erroneous images, are publicly available in https://github.com/mikkoim/taxonomist-studio.

Paper Structure

This paper contains 10 sections, 7 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: Examples of successful BIODISCOVER images
  • Figure 2: Examples of erroneous content in images captured by a BIODISCOVER imaging device: A: Bubbles, B: Detached body parts, C: Forceps, D: Misclassifications
  • Figure 3: Different possible groupings for our datasets. Each individual Specimen (Sp) belongs to a Taxon (T). When a specimen is imaged, a single imaging run is called a Sample (Sa), which consists of two image sequences from two camera angles. We refer to a sequence from one of the cameras as the Cam (C) sequence.
  • Figure 4: Comparison of the embedding method performance measured with several metrics and for different outlier types. The performance depends on the grouping chosen, seen in the x-axis in increasing group size. C=Cam, Sa=Sample, Sp=Specimen, T=Taxon. Fig. \ref{['fig:groupings']} explains the groupings. Overall best results are achieved with the taxon-level grouping.