Rodent-Bench

Thomas Heap; Laurence Aitchison; Emma Cahill; Adriana Casado Rodriguez

Rodent-Bench

Thomas Heap, Laurence Aitchison, Emma Cahill, Adriana Casado Rodriguez

TL;DR

Rodent-Bench targets the challenge of scalable, automated rodent behavioral annotation, where manual labeling is time-consuming and current Multimodal Large Language Models struggle with temporal and contextual reasoning. The authors introduce two dataset variants (Rodent-Bench-Short and Rodent-Bench-Long), a JSON-based segment-annotation task, and a multi-metric evaluation framework (second-wise accuracy, macro F1, mAP, mutual information, MCC) to systematically compare state-of-the-art MLLMs. Experiments across Gemini-2.5-Pro, Gemini-2.5-Flash, and Qwen-VL-Max reveal substantial gaps, with grooming detection being the most favorable and tasks requiring precise temporal segmentation and context integration proving particularly challenging. By providing standardized prompts, schemas, and datasets, Rodent-Bench establishes a practical foundation for advancing reliable automated behavioral annotation in neuroscience research.

Abstract

We present Rodent-Bench, a novel benchmark designed to evaluate the ability of Multimodal Large Language Models (MLLMs) to annotate rodent behaviour footage. We evaluate state-of-the-art MLLMs, including Gemini-2.5-Pro, Gemini-2.5-Flash and Qwen-VL-Max, using this benchmark and find that none of these models perform strongly enough to be used as an assistant for this task. Our benchmark encompasses diverse datasets spanning multiple behavioral paradigms including social interactions, grooming, scratching, and freezing behaviors, with videos ranging from 10 minutes to 35 minutes in length. We provide two benchmark versions to accommodate varying model capabilities and establish standardized evaluation metrics including second-wise accuracy, macro F1, mean average precision, mutual information, and Matthew's correlation coefficient. While some models show modest performance on certain datasets (notably grooming detection), overall results reveal significant challenges in temporal segmentation, handling extended video sequences, and distinguishing subtle behavioral states. Our analysis identifies key limitations in current MLLMs for scientific video annotation and provides insights for future model development. Rodent-Bench serves as a foundation for tracking progress toward reliable automated behavioral annotation in neuroscience research.

Rodent-Bench

TL;DR

Abstract

Paper Structure (40 sections, 4 equations, 34 figures, 2 tables)

This paper contains 40 sections, 4 equations, 34 figures, 2 tables.

Introduction
Related Work
Rodent-Bench
Data Collection
Metrics
Experiments
Experimental Setup
Results
Limitations
Conclusion
Implementation Details
Model Access and Configuration
Video Processing Pipeline
Batch Processing Implementation
Model Specifications
...and 25 more sections

Figures (34)

Figure 1: Workflow for annotating rodent videos.
Figure 2: Performance metrics for Gemini-2.5-Pro across all datasets. Each metric shows substantial variation across behavioral paradigms, with the grooming detection dataset achieving the highest performance across most metrics. Social behaviors (CalMS21) show moderate performance, while challenging datasets like freezing and scratch detection exhibit poor performance approaching chance levels. Dashed lines indicate theoretical maximum performance where applicable. Error bars represent $2 \times$ standard error across videos within each dataset. The consistently low performance on certain datasets highlights the difficulty of fine-grained temporal behavioral annotation for current MLLMs.
Figure 3: Weighted Matthew's Correlation Coefficient (MCC) performance across models. (a) Rodent-Bench-Long: Gemini-2.5-Pro achieves the highest performance with lower variance compared to Gemini-2.5-Flash. (b) Rodent-Bench-Short: Similar performance hierarchy with Gemini-2.5-Pro outperforming Flash, while Qwen-VL-Max shows near-chance performance. Error bars represent $2 \times$ standard error across datasets. All models show modest performance levels, indicating substantial room for improvement in behavioral annotation tasks.
Figure 4: Behavior Proportions for each dataset.
Figure 5: CaLMS21 Behaviors
...and 29 more figures

Rodent-Bench

TL;DR

Abstract

Rodent-Bench

Authors

TL;DR

Abstract

Table of Contents

Figures (34)