No Free Lunch in Annotation either: An objective evaluation of foundation models for streamlining annotation in animal tracking
Emil Mededovic, Valdy Laurentius, Yuli Wu, Marcin Kopaczka, Zhu Chen, Mareike Schulz, René Tolba, Johannes Stegmaier
TL;DR
The paper investigates the reliability of foundation-model–aided annotation for long-horizon animal tracking and presents SAM-QA, a lightweight, semi-automatic workflow that fine-tunes a Segment Anything Model and enforces a quality-control loop with spatio-temporal consistency checks and SAM2-based recovery. On rat and mouse datasets, SAM-QA delivers the strongest automated-label performance among the tested methods, narrowing the gap to manual annotations but not matching them yet. The findings underscore the need for careful integration of automated annotations with targeted human oversight to maintain tracking accuracy, and they identify SAM-2V as a promising avenue for future improvements through finer-tuning and tighter quality integration. Overall, the work highlights practical pathways to accelerate annotation while preserving data quality for robust animal-tracking models.
Abstract
We analyze the capabilities of foundation models addressing the tedious task of generating annotations for animal tracking. Annotating a large amount of data is vital and can be a make-or-break factor for the robustness of a tracking model. Robustness is particularly crucial in animal tracking, as accurate tracking over long time horizons is essential for capturing the behavior of animals. However, generating additional annotations using foundation models can be counterproductive, as the quality of the annotations is just as important. Poorly annotated data can introduce noise and inaccuracies, ultimately compromising the performance and accuracy of the trained model. Over-reliance on automated annotations without ensuring precision can lead to diminished results, making careful oversight and quality control essential in the annotation process. Ultimately, we demonstrate that a thoughtful combination of automated annotations and manually annotated data is a valuable strategy, yielding an IDF1 score of 80.8 against blind usage of SAM2 video with an IDF1 score of 65.6.
