A benchmark for computational analysis of animal behavior, using animal-borne tags
Benjamin Hoffman, Maddie Cusimano, Vittorio Baglione, Daniela Canestrari, Damien Chevallier, Dominic L. DeSantis, Lorène Jeantet, Monique A. Ladds, Takuya Maekawa, Vicente Mata-Silva, Víctor Moreno-González, Anthony Pagano, Eva Trapote, Outi Vainio, Antti Vehkaoja, Ken Yoda, Katherine Zacarian, Ari Friedlaender
TL;DR
The paper introduces the Bio-logger Ethogram Benchmark (BEBE), the largest publicly available multi-species benchmark for analyzing animal behavior from bio-logger time-series data. It standardizes a behavior classification task across nine annotated datasets, enabling robust comparison of deep neural networks and classical ML methods, and assesses the impact of self-supervised pre-training using human accelerometer data. Across datasets, deep models, particularly CRNN and harnet, outperform traditional feature-based approaches, with self-supervised pre-training providing notable advantages in low-data settings. BEBE provides open datasets and code, furnishes concrete design guidance for ML-based ethology studies, and invites community collaboration to broaden taxonomic coverage and modeling tasks for conservation and behavioral ecology. The work advances scalable, cross-species ML evaluation for bio-logging and highlights the promise and limits of self-supervised transfer in animal behavior inference.
Abstract
Animal-borne sensors (`bio-loggers') can record a suite of kinematic and environmental data, which are used to elucidate animal ecophysiology and improve conservation efforts. Machine learning techniques are used for interpreting the large amounts of data recorded by bio-loggers, but there exists no common framework for comparing the different machine learning techniques in this domain. This makes it difficult to, for example, identify patterns in what works well for machine learning-based analysis of bio-logger data. It also makes it difficult to evaluate the effectiveness of novel methods developed by the machine learning community. To address this, we present the Bio-logger Ethogram Benchmark (BEBE), a collection of datasets with behavioral annotations, as well as a modeling task and evaluation metrics. BEBE is to date the largest, most taxonomically diverse, publicly available benchmark of this type. Using BEBE, we compare the performance of deep and classical machine learning methods for identifying animal behaviors based on bio-logger data. As an example usage of BEBE, we test an approach based on self-supervised learning. To apply this approach to animal behavior classification, we adapt a deep neural network pre-trained with 700,000 hours of data collected from human wrist-worn accelerometers. We find that deep neural networks out-perform the classical machine learning methods we tested across all nine datasets in BEBE. We additionally find that the approach based on self-supervised learning out-performs the alternatives we tested, especially in settings when there is a low amount of training data available. In light of this, we are able to make concrete suggestions for designing studies that rely on machine learning to infer behavior from bio-logger data. Datasets and code are available at https://github.com/earthspecies/BEBE.
