Table of Contents
Fetching ...

BehaviorVLM: Unified Finetuning-Free Behavioral Understanding with Vision-Language Reasoning

Jingyang Ke, Weihan Li, Amartya Pradhan, Jeffrey Markowitz, Anqi Wu

Abstract

Understanding freely moving animal behavior is central to neuroscience, where pose estimation and behavioral understanding form the foundation for linking neural activity to natural actions. Yet both tasks still depend heavily on human annotation or unstable unsupervised pipelines, limiting scalability and reproducibility. We present BehaviorVLM, a unified vision-language framework for pose estimation and behavioral understanding that requires no task-specific finetuning and minimal human labeling by guiding pretrained Vision-Language Models (VLMs) through detailed, explicit, and verifiable reasoning steps. For pose estimation, we leverage quantum-dot-grounded behavioral data and propose a multi-stage pipeline that integrates temporal, spatial, and cross-view reasoning. This design greatly reduces human annotation effort, exposes low-confidence labels through geometric checks such as reprojection error, and produces labels that can later be filtered, corrected, or used to fine-tune downstream pose models. For behavioral understanding, we propose a pipeline that integrates deep embedded clustering for over-segmented behavior discovery, VLM-based per-clip video captioning, and LLM-based reasoning to merge and semantically label behavioral segments. The behavioral pipeline can operate directly from visual information and does not require keypoints to segment behavior. Together, these components enable scalable, interpretable, and label-light analysis of multi-animal behavior.

BehaviorVLM: Unified Finetuning-Free Behavioral Understanding with Vision-Language Reasoning

Abstract

Understanding freely moving animal behavior is central to neuroscience, where pose estimation and behavioral understanding form the foundation for linking neural activity to natural actions. Yet both tasks still depend heavily on human annotation or unstable unsupervised pipelines, limiting scalability and reproducibility. We present BehaviorVLM, a unified vision-language framework for pose estimation and behavioral understanding that requires no task-specific finetuning and minimal human labeling by guiding pretrained Vision-Language Models (VLMs) through detailed, explicit, and verifiable reasoning steps. For pose estimation, we leverage quantum-dot-grounded behavioral data and propose a multi-stage pipeline that integrates temporal, spatial, and cross-view reasoning. This design greatly reduces human annotation effort, exposes low-confidence labels through geometric checks such as reprojection error, and produces labels that can later be filtered, corrected, or used to fine-tune downstream pose models. For behavioral understanding, we propose a pipeline that integrates deep embedded clustering for over-segmented behavior discovery, VLM-based per-clip video captioning, and LLM-based reasoning to merge and semantically label behavioral segments. The behavioral pipeline can operate directly from visual information and does not require keypoints to segment behavior. Together, these components enable scalable, interpretable, and label-light analysis of multi-animal behavior.
Paper Structure (22 sections, 5 figures)

This paper contains 22 sections, 5 figures.

Figures (5)

  • Figure 1: Overview of BehaviorVLM. This VLM & LLM-based framework addresses pose estimation and behavioral understanding with minimum manual labeling and no finetuning.
  • Figure 2: Pose estimation experimental setup.(A) Data collection: a mouse injected with near-infrared quantum dots (QDs) at 12 body keypoints is recorded by six synchronized NIR-optimized cameras (B) Example six-view frames with QD fluorescence centroids detected and overlaid as numbered candidates on the reflectance images. Centroid indices are local to each view; the goal is to assign anatomical identities to these candidates across all cameras and timepoints.
  • Figure 3: BehaviorVLM pose estimation pipeline and results.(A) pipeline overview. (B) Detailed example for one frame from camera 0: the VLM first localizes four body regions (ears, back, paws, tail) via bounding boxes, then assigns centroids to keypoints within each cropped region, merges assignments, and resolves conflicts. Six-view predictions are triangulated into 3D and refined via RANSAC consensus. (C) Ablation study showing mean 3D keypoint error (mm) averaged over 12 keypoints across 500 frames. The full BehaviorVLM pipeline (6.59 mm) outperforms variants without 3D cross-view refinement (9.16 mm) and without both body region detection and 3D refinement (14.29 mm), demonstrating the contribution of each component. (D) Representative 3D keypoint trajectories for four body keypoints: back_top, tail_tip, ear_R, and hindpaw_R. Ground truth is shown in orange, BehaviorVLM predictions in blue.
  • Figure 4: Overview of the BehaviorVLM pipeline for semantic behavioral understanding. Behavioral features are first over-segmented into fine-grained candidate clips for each animal. A vision-language model (VLM) then generates natural-language labels and descriptions for each clip. The mouse A0 segments shown here are the direct VLM-stage segments and are therefore more fine-grained than the final LLM-merged mouse A0 segments shown in Figure \ref{['fig:behavior_results']}.
  • Figure 5: Behavioral understanding results for video 3ZOUFPHJ7JOHFBE8RHY6 in the MABe2022 Mouse Triplets dataset. BehaviorVLM produces temporally coherent behavioral segmentation for each mouse. For every candidate segment, a vision-language model (VLM) first generates natural-language descriptions of the observed actions and interactions (Figure \ref{['fig:behavior_pipeline']}, mouse A0 segments). A large language model (LLM) then refines and merges these descriptions into the final behavioral events shown here.