Table of Contents
Fetching ...

On Model and Data Scaling for Skeleton-based Self-Supervised Gait Recognition

Adrian Cosma, Andy Cǎtrunǎ, Emilian Rǎdoi

TL;DR

This work conducts the first empirical study scaling on skeleton-based self-supervised gait recognition to quantify the effect of data quantity, model size and compute on downstream gait recognition performance and demonstrates predictable power-law improvements in performance with increased scale.

Abstract

Gait recognition from video streams is a challenging problem in computer vision biometrics due to the subtle differences between gaits and numerous confounding factors. Recent advancements in self-supervised pretraining have led to the development of robust gait recognition models that are invariant to walking covariates. While neural scaling laws have transformed model development in other domains by linking performance to data, model size, and compute, their applicability to gait remains unexplored. In this work, we conduct the first empirical study scaling on skeleton-based self-supervised gait recognition to quantify the effect of data quantity, model size and compute on downstream gait recognition performance. We pretrain multiple variants of GaitPT - a transformer-based architecture - on a dataset of 2.7 million walking sequences collected in the wild. We evaluate zero-shot performance across four benchmark datasets to derive scaling laws for data, model size, and compute. Our findings demonstrate predictable power-law improvements in performance with increased scale and confirm that data and compute scaling significantly influence downstream accuracy. We further isolate architectural contributions by comparing GaitPT with GaitFormer under controlled compute budgets. These results provide practical insights into resource allocation and performance estimation for real-world gait recognition systems.

On Model and Data Scaling for Skeleton-based Self-Supervised Gait Recognition

TL;DR

This work conducts the first empirical study scaling on skeleton-based self-supervised gait recognition to quantify the effect of data quantity, model size and compute on downstream gait recognition performance and demonstrates predictable power-law improvements in performance with increased scale.

Abstract

Gait recognition from video streams is a challenging problem in computer vision biometrics due to the subtle differences between gaits and numerous confounding factors. Recent advancements in self-supervised pretraining have led to the development of robust gait recognition models that are invariant to walking covariates. While neural scaling laws have transformed model development in other domains by linking performance to data, model size, and compute, their applicability to gait remains unexplored. In this work, we conduct the first empirical study scaling on skeleton-based self-supervised gait recognition to quantify the effect of data quantity, model size and compute on downstream gait recognition performance. We pretrain multiple variants of GaitPT - a transformer-based architecture - on a dataset of 2.7 million walking sequences collected in the wild. We evaluate zero-shot performance across four benchmark datasets to derive scaling laws for data, model size, and compute. Our findings demonstrate predictable power-law improvements in performance with increased scale and confirm that data and compute scaling significantly influence downstream accuracy. We further isolate architectural contributions by comparing GaitPT with GaitFormer under controlled compute budgets. These results provide practical insights into resource allocation and performance estimation for real-world gait recognition systems.

Paper Structure

This paper contains 13 sections, 10 figures, 3 tables.

Figures (10)

  • Figure 1: We trained multiple scales of skeleton-based gait recognition models in a self-supervised learning regime on a dataset of 2.7M walking sequences and analyised scaling trends in terms of model size, dataset size and compute allocation.
  • Figure 2: Selected snapshots of different camera feeds used in our dataset annotated with skeleton sequences extracted using a pretrained multi-person pose estimation model. Street webcams in populated areas enable fast and large-scale extraction of gait data.
  • Figure 3: Scaling trends for increasing the model size by parameter count, across multiple dataset sizes. We compute scaling trends only on the data points marked with a "$\bullet$" symbol, while the "$\star$" data point is used for validation. Increasing the parameter count yields a predictable positive increase in performance.
  • Figure 4: Scaling trends for increasing the dataset size in terms of the number of skeleton sequences, across multiple model sizes. We compute scaling curves from points marked "$\bullet$". The point marked with "$\star$" is used for validation. Increasing the dataset size yields a predictable positive increase in performance.
  • Figure 5: Comparison of data scaling behaviour between models trained on high quality samples versus models trained on samples from the original set.
  • ...and 5 more figures