Table of Contents
Fetching ...

A benchmark for computational analysis of animal behavior, using animal-borne tags

Benjamin Hoffman, Maddie Cusimano, Vittorio Baglione, Daniela Canestrari, Damien Chevallier, Dominic L. DeSantis, Lorène Jeantet, Monique A. Ladds, Takuya Maekawa, Vicente Mata-Silva, Víctor Moreno-González, Anthony Pagano, Eva Trapote, Outi Vainio, Antti Vehkaoja, Ken Yoda, Katherine Zacarian, Ari Friedlaender

TL;DR

The paper introduces the Bio-logger Ethogram Benchmark (BEBE), the largest publicly available multi-species benchmark for analyzing animal behavior from bio-logger time-series data. It standardizes a behavior classification task across nine annotated datasets, enabling robust comparison of deep neural networks and classical ML methods, and assesses the impact of self-supervised pre-training using human accelerometer data. Across datasets, deep models, particularly CRNN and harnet, outperform traditional feature-based approaches, with self-supervised pre-training providing notable advantages in low-data settings. BEBE provides open datasets and code, furnishes concrete design guidance for ML-based ethology studies, and invites community collaboration to broaden taxonomic coverage and modeling tasks for conservation and behavioral ecology. The work advances scalable, cross-species ML evaluation for bio-logging and highlights the promise and limits of self-supervised transfer in animal behavior inference.

Abstract

Animal-borne sensors (`bio-loggers') can record a suite of kinematic and environmental data, which are used to elucidate animal ecophysiology and improve conservation efforts. Machine learning techniques are used for interpreting the large amounts of data recorded by bio-loggers, but there exists no common framework for comparing the different machine learning techniques in this domain. This makes it difficult to, for example, identify patterns in what works well for machine learning-based analysis of bio-logger data. It also makes it difficult to evaluate the effectiveness of novel methods developed by the machine learning community. To address this, we present the Bio-logger Ethogram Benchmark (BEBE), a collection of datasets with behavioral annotations, as well as a modeling task and evaluation metrics. BEBE is to date the largest, most taxonomically diverse, publicly available benchmark of this type. Using BEBE, we compare the performance of deep and classical machine learning methods for identifying animal behaviors based on bio-logger data. As an example usage of BEBE, we test an approach based on self-supervised learning. To apply this approach to animal behavior classification, we adapt a deep neural network pre-trained with 700,000 hours of data collected from human wrist-worn accelerometers. We find that deep neural networks out-perform the classical machine learning methods we tested across all nine datasets in BEBE. We additionally find that the approach based on self-supervised learning out-performs the alternatives we tested, especially in settings when there is a low amount of training data available. In light of this, we are able to make concrete suggestions for designing studies that rely on machine learning to infer behavior from bio-logger data. Datasets and code are available at https://github.com/earthspecies/BEBE.

A benchmark for computational analysis of animal behavior, using animal-borne tags

TL;DR

The paper introduces the Bio-logger Ethogram Benchmark (BEBE), the largest publicly available multi-species benchmark for analyzing animal behavior from bio-logger time-series data. It standardizes a behavior classification task across nine annotated datasets, enabling robust comparison of deep neural networks and classical ML methods, and assesses the impact of self-supervised pre-training using human accelerometer data. Across datasets, deep models, particularly CRNN and harnet, outperform traditional feature-based approaches, with self-supervised pre-training providing notable advantages in low-data settings. BEBE provides open datasets and code, furnishes concrete design guidance for ML-based ethology studies, and invites community collaboration to broaden taxonomic coverage and modeling tasks for conservation and behavioral ecology. The work advances scalable, cross-species ML evaluation for bio-logging and highlights the promise and limits of self-supervised transfer in animal behavior inference.

Abstract

Animal-borne sensors (`bio-loggers') can record a suite of kinematic and environmental data, which are used to elucidate animal ecophysiology and improve conservation efforts. Machine learning techniques are used for interpreting the large amounts of data recorded by bio-loggers, but there exists no common framework for comparing the different machine learning techniques in this domain. This makes it difficult to, for example, identify patterns in what works well for machine learning-based analysis of bio-logger data. It also makes it difficult to evaluate the effectiveness of novel methods developed by the machine learning community. To address this, we present the Bio-logger Ethogram Benchmark (BEBE), a collection of datasets with behavioral annotations, as well as a modeling task and evaluation metrics. BEBE is to date the largest, most taxonomically diverse, publicly available benchmark of this type. Using BEBE, we compare the performance of deep and classical machine learning methods for identifying animal behaviors based on bio-logger data. As an example usage of BEBE, we test an approach based on self-supervised learning. To apply this approach to animal behavior classification, we adapt a deep neural network pre-trained with 700,000 hours of data collected from human wrist-worn accelerometers. We find that deep neural networks out-perform the classical machine learning methods we tested across all nine datasets in BEBE. We additionally find that the approach based on self-supervised learning out-performs the alternatives we tested, especially in settings when there is a low amount of training data available. In light of this, we are able to make concrete suggestions for designing studies that rely on machine learning to infer behavior from bio-logger data. Datasets and code are available at https://github.com/earthspecies/BEBE.
Paper Structure (39 sections, 2 equations, 34 figures, 3 tables)

This paper contains 39 sections, 2 equations, 34 figures, 3 tables.

Figures (34)

  • Figure 1: A) Examples of ethograms in BEBE. Left: gull ethogram with three behaviors. Right: a subset of the dog ethogram, with four behaviors. B) BEBE consists of a supervised behavior classification task on nine annotated datasets, along with a set of metrics that compare model predictions with the annotations. Datasets and code are publicly available at https://github.com/earthspecies/BEBE. C) Datasets in BEBE, with a photo of a representative individual and a 5-minute clip of annotated tri-axial accelerometer (TIA) data for each. Each accelerometer channel is min-max scaled for visualization. Top row: black-tailed gull (Larus crassirostris) korpelaMachineLearningEnables2020, domestic dog (Canis familiaris) kumpulainenDogBehaviourClassification2021vehkaojaDescriptionMovementSensor2022, carrion crow (Corvus corone) stidsholt2019tag (see Methods). Middle row: western diamondback rattlesnake (Crotalus atrox) desantisIntegrativeFrameworkLongTerm2020, humpback whale (Megaptera novaeangliae) friedlaenderExtremeDielVariation2013, New Zealand fur seal (Arctocephalus forsteri) laddsSeeingItAll2016. Bottom row: polar bear (Ursus maritimus) paganoMetabolicRateBody2018paganoUsingTriaxialAccelerometers2017, sea turtle (Chelonia mydas) jeantetBehaviouralInferenceSignal2020, human (Homo sapiens) anguitaPublicDomainDataset2013. Gaps indicate that the behavior annotation is Unknown. For image attributions, see acknowledgments.
  • Figure 2: Example data from BEBE. Each row displays ten 1-minute clips from one dataset, showing behavior labels, three tri-axial accelerometer channels ($g$), as well as speed ($m/s$), saltwater conductivity (wet/dry), and/or depth ($m$) if available. Examples were chosen to focus on transitions between behaviors. Acceleration traces for behavior classes range from highly stereotyped (e.g., Sit in HAR) to highly variable (e.g., Feed in Seals). For examples of each behavior in each dataset, with the full set of dataset channels, see Supplemental Figures \ref{['HAR_examples_supplement']}-\ref{['dog_examples1_supplement']}.
  • Figure 3: A) Summary of training and evaluation. Our process of data analysis follows the standard three steps of creating and evaluating machine learning models. In the first step (Training), the model learns from the train set of one dataset, including behavioral annotations. In the second step (Inference), the model makes predictions about the behavioral annotation for the test set data, which comprises data from a set of individuals distinct to those in the train set. In the third step (Evaluation), the model's predictions are evaluated based on their agreement with known behavioral annotations. B) Example data from the Whale dataset friedlaenderExtremeDielVariation2013, and predictions made by a CRNN model. The trained model is fed raw time series data, which it uses to make behavior predictions. These predictions are compared with annotations to arrive at performance scores. In this case, the model predicts the annotations well. Gaps in the behavior annotations indicate the behavior is Unknown at those samples; those samples are ignored in the evaluation metrics. C) During hyperparameter optimization, we train a set of models with various hyperparameters and low/high frequency cutoffs. We obtain the model hyperparameters and low/high frequency cutoff from the model that maximizes the F1 score on the first test fold. D) During cross-validation, we compute the test scores for the other four folds. The final score is averaged across all individuals in the test folds. The first test fold, used for hyperparameter optimization, is not used for testing.
  • Figure 4: F1 scores on the test set for supervised task. Here and elsewhere, the table is color-coded such that within a dataset (column), the brightest color indicates the best performing model for that metric, and the darkest color indicates the worst performing model. Numbers indicate the average score across individuals in the test folds, with the standard deviation in parentheses. The F1 score is macro-averaged across classes. Out of nine datasets, harnet does best on five datasets for F1, as indicated by the bright yellow entries in its row. CRNN does best on the other four datasets. For precision and recall results, see Figure \ref{['supplement_precision_recall_basic_results']}.
  • Figure 5: Self-supervised pre-training and reduced data setting. A) Pre-training task (performed in yuan2022selfsupervised): The main component of our harnet model has a Resnet architecture heDeepResidualLearning2015. The Resnet was pre-trained with un-annotated human wrist-worn accelerometer data, which was modified with one of a set of signal transformations (e.g. $f_0$ = reversal in time). The network was trained to classify which transformation was applied to the original data. B) In our harnet model, the input to the pre-trained Resnet was animal bio-logger data, without any modification to sampling rate. The outputs of the Resnet were passed to a recurrent neural network (RNN), which produced the behavior predictions. This full harnet model was then trained as shown in Figure \ref{['evaluation_summary']}. C) In the full data setting, four out of five folds are used to train the model in one-instance of cross validation. In the reduced data setting, only one fold is used for training while the test set is the same. In other words, approximately four times more individuals are included in the train set in the full data setting, than in the reduced data setting. D) F1 scores for full data task. harnet frozen does best on five datasets and CRNN does best on three datasets. We omitted the RNN wavelet model from the full data experiments, due to high computational resources required for training, and its poor performance in the reduced data setting. E) F1 scores for the reduced data task. harnet frozen does the best on all nine datasets. F) Difference in F1 between reduced and full data tasks. For five datasets, harnet frozen shows the smallest decrease in F1 when using reduced data. For precision and recall results, see Figure \ref{['supplement_precision_recall_representation_results']}.
  • ...and 29 more figures