Table of Contents
Fetching ...

A study of animal action segmentation algorithms across supervised, unsupervised, and semi-supervised learning paradigms

Ari Blau, Evan S Schaffer, Neeli Mishra, Nathaniel J Miska, The International Brain Laboratory, Liam Paninski, Matthew R Whiteway

TL;DR

This study benchmarks animal action segmentation across supervised, unsupervised, and semi-supervised paradigms using four diverse datasets, revealing that fully supervised temporal convolutional networks with temporal features consistently achieve the best supervised metrics. The authors introduce S$^3$LDS, a semi-supervised switching linear dynamical system that leverages a small amount of labeled data within a variational framework to bridge deep inference and classical dynamical models. Across datasets, semi-supervised gains depend on the behavioral representation: temporal information in observations and inference networks boosts performance with position features, while velocity features favor purely supervised models. The work demonstrates how labeled data shape latent representations and offers a scalable codebase for semi-supervised action segmentation, with implications for refining behavioral labels and extending to semi-unsupervised discovery of novel behaviors. These findings highlight the importance of feature design and the potential of semi-supervised approaches to efficiently leverage unlabeled data in animal behavior analysis.

Abstract

Action segmentation of behavioral videos is the process of labeling each frame as belonging to one or more discrete classes, and is a crucial component of many studies that investigate animal behavior. A wide range of algorithms exist to automatically parse discrete animal behavior, encompassing supervised, unsupervised, and semi-supervised learning paradigms. These algorithms -- which include tree-based models, deep neural networks, and graphical models -- differ widely in their structure and assumptions on the data. Using four datasets spanning multiple species -- fly, mouse, and human -- we systematically study how the outputs of these various algorithms align with manually annotated behaviors of interest. Along the way, we introduce a semi-supervised action segmentation model that bridges the gap between supervised deep neural networks and unsupervised graphical models. We find that fully supervised temporal convolutional networks with the addition of temporal information in the observations perform the best on our supervised metrics across all datasets.

A study of animal action segmentation algorithms across supervised, unsupervised, and semi-supervised learning paradigms

TL;DR

This study benchmarks animal action segmentation across supervised, unsupervised, and semi-supervised paradigms using four diverse datasets, revealing that fully supervised temporal convolutional networks with temporal features consistently achieve the best supervised metrics. The authors introduce SLDS, a semi-supervised switching linear dynamical system that leverages a small amount of labeled data within a variational framework to bridge deep inference and classical dynamical models. Across datasets, semi-supervised gains depend on the behavioral representation: temporal information in observations and inference networks boosts performance with position features, while velocity features favor purely supervised models. The work demonstrates how labeled data shape latent representations and offers a scalable codebase for semi-supervised action segmentation, with implications for refining behavioral labels and extending to semi-unsupervised discovery of novel behaviors. These findings highlight the importance of feature design and the potential of semi-supervised approaches to efficiently leverage unlabeled data in animal behavior analysis.

Abstract

Action segmentation of behavioral videos is the process of labeling each frame as belonging to one or more discrete classes, and is a crucial component of many studies that investigate animal behavior. A wide range of algorithms exist to automatically parse discrete animal behavior, encompassing supervised, unsupervised, and semi-supervised learning paradigms. These algorithms -- which include tree-based models, deep neural networks, and graphical models -- differ widely in their structure and assumptions on the data. Using four datasets spanning multiple species -- fly, mouse, and human -- we systematically study how the outputs of these various algorithms align with manually annotated behaviors of interest. Along the way, we introduce a semi-supervised action segmentation model that bridges the gap between supervised deep neural networks and unsupervised graphical models. We find that fully supervised temporal convolutional networks with the addition of temporal information in the observations perform the best on our supervised metrics across all datasets.
Paper Structure (39 sections, 26 equations, 18 figures, 4 tables)

This paper contains 39 sections, 26 equations, 18 figures, 4 tables.

Figures (18)

  • Figure 1: Overview of the action segmentation pipeline. Raw sensor data (e.g. video) is collected, then features are extracted (e.g. pose estimates), then an action segmentation model is trained to map those features to a discrete behavioral class for each frame.
  • Figure 2: Overview of action segmentation models.A: Top: Graphical model for supervised classification. Both discrete states $y_t$ and poses $\mathbf{x}_t$ are observed. Bottom: Inference network for the supervised model. We use a window of observed behavioral features for state prediction. B: Top: Graphical model for an unsupervised recurrent switching dynamical system. The set of discrete states $\{y_t\}$ and continuous latents $\{\mathbf{z}_t\}$ are unobserved. Bottom: The inference network uses a window of observed behavioral features to create a deterministic hidden representation $\mathbf{h}_t$ (purple arrows); this is then used to predict the continuous latents $\mathbf{z}_t$ (blue arrows) and discrete latents $y_t$ (red arrows). Note that the purple and red arrows together define a classifier for the discrete state at each time step. C: Graphical model and inference network for a semi-supervised recurrent switching dynamical system. A subset of the discrete states are observed. During inference, the observed discrete state is used for the inference of $\mathbf{z}_t$ when possible.
  • Figure 3: Supervised vs semi-supervised results for the head-fixed fly.A: Example frame of the fly, overlaid with pose markers. B: Proportion of each labeled behavior in the training dataset. C: Sample of ground truth labels, along with predictions from both the TCN and the S$^3$LDS models. Below is a subset of the corresponding features used as inputs to the models. D: F1 scores for the TCN and S$^3$LDS models. We show results for the position features (solid lines) as well as the position-velocity features (dashed lines). Adding velocity improves performance for both models. The number of unlabeled frames used in the models with the smallest number of labeled frames is displayed in the upper right corner of the graph; this number decreases as we add labels for each consecutive set of models. Error bars represent the standard deviation of the F1 scores over five subsamples of the training data. E: Confusion matrices for the TCN and S$^3$LDS models. F: Average entropy of the false positives (left) and true positives (right) for both models. Entropy results for the other datasets are shown in Fig. \ref{['fig:results_super_others_detail']}. Panels E and F show results from the models trained on all labeled frames with position-velocity features.
  • Figure 4: Supervised vs semi-supervised results across datasets. Conventions as in Fig. \ref{['fig:results_super_fly']}. As in the head-fixed fly, we find that using position-velocity features improves performance over the position features across both model types, and in all datasets the TCN performs best. A: Results on the freely moving mouse dataset. Rather than using the raw poses, we compute the features introduced in Sturman2020. These features compute transformations on the poses, including distances and angles between different groups of keypoints. B: Results on the head-fixed mouse dataset. C: Results on the HuGaDB dataset. The data is collected from sensors that already contain velocity data, so we only use one set of features.
  • Figure 5: Supervised and semi-supervised latent spaces more closely align with labels than unsupervised latents (head-fixed fly). All models use position-velocity features and all available training videos for the head-fixed fly dataset. A: The top row shows a segment of ground truth labels. The following two rows show predictions from the TCN and S$^3$LDS models. The third row shows the state outputs of keypoint-MoSeq (KPM), aligned to the ground truth class with highest overlap on the training data. The final row shows the raw state outputs of keypoint-MoSeq. B: F1 scores for the TCN, S$^3$LDS and KPM models. Error bars represent the standard deviation of the F1 scores over five trained models (different initialization seeds). C: 2D UMAP embedding of continuous latents colored by discrete labels for three different models. D: The addition of hand labels produces more homogeneous clusters in the models’ latent spaces. Error bars represent the standard deviation of the cluster scores over five models. We use a range of cluster numbers to show that cluster scores are not biased by cluster size.
  • ...and 13 more figures