Table of Contents
Fetching ...

Eye on the Target: Eye Tracking Meets Rodent Tracking

Emil Mededovic, Yuli Wu, Henning Konermann, Marcin Kopaczka, Mareike Schulz, Rene Tolba, Johannes Stegmaier

TL;DR

This work tackles the bottleneck of manual annotation in rodent behavioral analysis by introducing a gaze-driven prompting pipeline that converts eye-tracking data into segmentation prompts for a fast zero-shot model. It integrates depth-aware refinement, local exploratory sampling, and Kalman filtering to iteratively improve segmentation masks without retraining. Across two rodent datasets and nine participants, depth-aware refinement emerges as the most robust post-processing strategy, yielding substantial gains in Jaccard and Dice scores, particularly for rats, while LES provides strong improvements when initial prompts are reasonable. The approach offers a scalable, annotation-efficient pathway for automated behavioral analysis with practical implications for high-throughput neuroscience studies.

Abstract

Analyzing animal behavior from video recordings is crucial for scientific research, yet manual annotation remains labor-intensive and prone to subjectivity. Efficient segmentation methods are needed to automate this process while maintaining high accuracy. In this work, we propose a novel pipeline that utilizes eye-tracking data from Aria glasses to generate prompt points, which are then used to produce segmentation masks via a fast zero-shot segmentation model. Additionally, we apply post-processing to refine the prompts, leading to improved segmentation quality. Through our approach, we demonstrate that combining eye-tracking-based annotation with smart prompt refinement can enhance segmentation accuracy, achieving an improvement of 70.6% from 38.8 to 66.2 in the Jaccard Index for segmentation results in the rats dataset.

Eye on the Target: Eye Tracking Meets Rodent Tracking

TL;DR

This work tackles the bottleneck of manual annotation in rodent behavioral analysis by introducing a gaze-driven prompting pipeline that converts eye-tracking data into segmentation prompts for a fast zero-shot model. It integrates depth-aware refinement, local exploratory sampling, and Kalman filtering to iteratively improve segmentation masks without retraining. Across two rodent datasets and nine participants, depth-aware refinement emerges as the most robust post-processing strategy, yielding substantial gains in Jaccard and Dice scores, particularly for rats, while LES provides strong improvements when initial prompts are reasonable. The approach offers a scalable, annotation-efficient pathway for automated behavioral analysis with practical implications for high-throughput neuroscience studies.

Abstract

Analyzing animal behavior from video recordings is crucial for scientific research, yet manual annotation remains labor-intensive and prone to subjectivity. Efficient segmentation methods are needed to automate this process while maintaining high accuracy. In this work, we propose a novel pipeline that utilizes eye-tracking data from Aria glasses to generate prompt points, which are then used to produce segmentation masks via a fast zero-shot segmentation model. Additionally, we apply post-processing to refine the prompts, leading to improved segmentation quality. Through our approach, we demonstrate that combining eye-tracking-based annotation with smart prompt refinement can enhance segmentation accuracy, achieving an improvement of 70.6% from 38.8 to 66.2 in the Jaccard Index for segmentation results in the rats dataset.

Paper Structure

This paper contains 13 sections, 19 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Aria glasses engel2023project integrate cameras to monitor eye movements and capture the user's field of view. The gaze estimation process consists of the following transformations: (1) Eye to Eye Tracker ($\textbf{T}_{\mathrm{EET}}$): Capturing eye movements via dedicated cameras. (2) Eye Tracker to Gaze ($\textbf{T}_{\mathrm{ETG}}$): Deriving gaze direction from eye tracker data. (3) Gaze to World ($\textbf{T}_{\mathrm{GW}}$): Mapping gaze coordinates onto the world coordinate system. (4) World to Image ($\textbf{T}_{\mathrm{WI}}$): Projecting world coordinates onto the image plane. The extracted gaze-based prompts serve as inputs for segmentation using EfficientSAM xiong2024efficientsam, with optional post-processing to refine and enhance the segmentation quality.
  • Figure 2: For depth-aware refinement, we begin by computing the depth map using Depth-Anything v2 yang2024depth. We then iteratively identify local maxima, ensuring that each selected peak is accompanied by a zone of exclusion to prevent overcrowding and the selection of adjacent, closely spaced peaks.