Table of Contents
Fetching ...

Learning When to Look: On-Demand Keypoint-Video Fusion for Animal Behavior Analysis

Weihan Li, Jingyang Ke, Yule Wang, Chengrui Li, Anqi Wu

TL;DR

LookAgain is introduced, a multimodal framework that combines the efficiency of keypoints with the representational power of video through on-demand visual grounding, enabling high-quality behavior analysis on long-duration recordings.

Abstract

Understanding animal behavior from video is essential for neuroscience research. Modern laboratories typically collect two complementary data streams: skeletal keypoints from pose estimation tools and raw video recordings. Keypoint-based methods are efficient but suffer from geometric ambiguity, environmental blindness, and sensitivity to occlusions. Video-based methods capture rich context but require processing every frame, making them impractical for the hundreds of hours of recordings that modern experiments produce. We introduce LookAgain, a multimodal framework that combines the efficiency of keypoints with the representational power of video through on-demand visual grounding. During training, LookAgain uses dense visual features to pretrain a motion encoder and to train a gating module that learns which frames require visual context. During inference, this gating module activates visual processing only when keypoint signals are ambiguous, while maintaining performance comparable to using all frames. Experiments on single-animal and multi-animal benchmarks show that LookAgain achieves strong performance with significantly reduced computational cost, enabling high-quality behavior analysis on long-duration recordings.

Learning When to Look: On-Demand Keypoint-Video Fusion for Animal Behavior Analysis

TL;DR

LookAgain is introduced, a multimodal framework that combines the efficiency of keypoints with the representational power of video through on-demand visual grounding, enabling high-quality behavior analysis on long-duration recordings.

Abstract

Understanding animal behavior from video is essential for neuroscience research. Modern laboratories typically collect two complementary data streams: skeletal keypoints from pose estimation tools and raw video recordings. Keypoint-based methods are efficient but suffer from geometric ambiguity, environmental blindness, and sensitivity to occlusions. Video-based methods capture rich context but require processing every frame, making them impractical for the hundreds of hours of recordings that modern experiments produce. We introduce LookAgain, a multimodal framework that combines the efficiency of keypoints with the representational power of video through on-demand visual grounding. During training, LookAgain uses dense visual features to pretrain a motion encoder and to train a gating module that learns which frames require visual context. During inference, this gating module activates visual processing only when keypoint signals are ambiguous, while maintaining performance comparable to using all frames. Experiments on single-animal and multi-animal benchmarks show that LookAgain achieves strong performance with significantly reduced computational cost, enabling high-quality behavior analysis on long-duration recordings.
Paper Structure (37 sections, 25 equations, 5 figures)

This paper contains 37 sections, 25 equations, 5 figures.

Figures (5)

  • Figure 1: Overview of the LookAgain framework. (A) Pretraining stage: The Motion Encoder learns to tokenize keypoint sequences and predict visual features from a frozen vision encoder (shown in gray). Four losses guide pretraining: masked keypoint prediction ($\mathcal{L}_{\text{mask}}$), cross-modal vision prediction ($\mathcal{L}_{\text{pred}}$), motion reconstruction ($\mathcal{L}_{\text{recon}}$), and Residual Vector Quantization (RVQ) commitment ($\mathcal{L}_{\text{RVQ}}$). (B) Fine-tuning stage: The Motion Encoder is frozen (shown in gray), and a gating module learns to determine when to activate visual processing at the frame level. Specifically, the gate identifies the top-$k$ most informative frames, on which visual features are extracted using a frozen vision encoder. These features are then fused with motion representations for supervised behavior classification ($\mathbf{h}_t \rightarrow \mathcal{L}_{\text{cls}}$) or unsupervised behavior segmentation ($\mathbf{f}_t \rightarrow \mathcal{L}_{\text{seg}}$). Solid lines denote forward data flow, while dashed lines indicate loss computation.
  • Figure 2: Supervised classification results on single mouse data. (A) Test F1 scores for Rearing and Grooming behaviors across different labeled data ratios. Our method outperforms the supervised Transformer and BAMS baselines, especially when labels are limited. (B) Ablation on top-$k$ settings (0%, 25%, 50%, 100%) showing that on-demand visual grounding (top-$k$=25%) achieves comparable performance to full visual processing (top-$k$=100%) while visual information provides the largest gains when labels are scarce.
  • Figure 3: Supervised classification results on multi-animal social behavior data. (A) Per-class test F1 scores comparing our method with BEiT + SimCLR + Hand-crafted features and BAMS. For each method, we evaluate keypoint-video fusion, video-only, and keypoint-only variants where applicable. Our method consistently outperforms all baselines across behavior classes, with keypoint-video fusion achieving the best performance. Visual information benefits Chase, Huddle, and Oral-Genital Contact, but introduces noise for Oral Contact, likely because Oral Contact depends on precise geometric relationships that keypoints capture directly. (B) Ablation on top-$k$ settings (25%, 50%, 75%, 100%). F1 scores increase with more visual frames for most behaviors, except Oral Contact where keypoint features alone are more effective.
  • Figure 4: Unsupervised segmentation results on multi-animal social behavior data. (A) Test macro F1 scores comparing our method with BAMS and Keypoint-MoSeq. Our method outperforms baselines even with top-$k$=0%, and its F1 score consistently increases as additional visual information is incorporated. (B) Visualization of segmentation results. Top: keypoint trajectories for all three mice. Bottom: segmentation bars from ground truth, our method, Keypoint-MoSeq, and BAMS. Our method produces segmentation most consistent with ground truth. (C) Example clusters discovered by our method: Feeding (mouse near feeder), Oral-Genital Contact, and Oral Contact. Our method can discover environment-dependent behaviors like Feeding, which keypoint-only methods cannot detect.
  • Figure 5: Ablation studies. (A) Pretrain+supervised finetune vs. supervised finetune-only: pretraining improves performance across all behavior classes on both datasets. (B) Learned gating vs. uniform sampling: our gating mechanism outperforms random frame selection, with the largest gains at lower top-$k$ ratios. (C) Gating component analysis: all three components are necessary, with motion saliency $m_t$ being the most important.