Table of Contents
Fetching ...

The Paradox of Motion: Evidence for Spurious Correlations in Skeleton-based Gait Recognition Models

Andy Cătrună, Adrian Cosma, Emilian Rădoi

TL;DR

The paper investigates whether skeleton-based gait recognition models rely on motion patterns or exploit appearance and anthropometric information embedded in pose data. It employs normalization-based ablations to suppress height and screen-position cues across multiple architectures and introduces a single-pose spatial-transformer to test appearance-only recognition, evaluated on CASIA-B and GREW. Key findings show that removing height/position cues degrades controlled-benchmark performance, that a single pose can achieve notable accuracy, and that in-the-wild data (GREW) reduce shortcuts, underscoring the need to disentangle motion from appearance and to curate diverse datasets. The work highlights privacy concerns and methodological biases in current benchmarks, advocating for robust, diverse gait datasets and balanced Use of appearance versus motion cues in practical gait analysis.

Abstract

Gait, an unobtrusive biometric, is valued for its capability to identify individuals at a distance, across external outfits and environmental conditions. This study challenges the prevailing assumption that vision-based gait recognition, in particular skeleton-based gait recognition, relies primarily on motion patterns, revealing a significant role of the implicit anthropometric information encoded in the walking sequence. We show through a comparative analysis that removing height information leads to notable performance degradation across three models and two benchmarks (CASIA-B and GREW). Furthermore, we propose a spatial transformer model processing individual poses, disregarding any temporal information, which achieves unreasonably good accuracy, emphasizing the bias towards appearance information and indicating spurious correlations in existing benchmarks. These findings underscore the need for a nuanced understanding of the interplay between motion and appearance in vision-based gait recognition, prompting a reevaluation of the methodological assumptions in this field. Our experiments indicate that "in-the-wild" datasets are less prone to spurious correlations, prompting the need for more diverse and large scale datasets for advancing the field.

The Paradox of Motion: Evidence for Spurious Correlations in Skeleton-based Gait Recognition Models

TL;DR

The paper investigates whether skeleton-based gait recognition models rely on motion patterns or exploit appearance and anthropometric information embedded in pose data. It employs normalization-based ablations to suppress height and screen-position cues across multiple architectures and introduces a single-pose spatial-transformer to test appearance-only recognition, evaluated on CASIA-B and GREW. Key findings show that removing height/position cues degrades controlled-benchmark performance, that a single pose can achieve notable accuracy, and that in-the-wild data (GREW) reduce shortcuts, underscoring the need to disentangle motion from appearance and to curate diverse datasets. The work highlights privacy concerns and methodological biases in current benchmarks, advocating for robust, diverse gait datasets and balanced Use of appearance versus motion cues in practical gait analysis.

Abstract

Gait, an unobtrusive biometric, is valued for its capability to identify individuals at a distance, across external outfits and environmental conditions. This study challenges the prevailing assumption that vision-based gait recognition, in particular skeleton-based gait recognition, relies primarily on motion patterns, revealing a significant role of the implicit anthropometric information encoded in the walking sequence. We show through a comparative analysis that removing height information leads to notable performance degradation across three models and two benchmarks (CASIA-B and GREW). Furthermore, we propose a spatial transformer model processing individual poses, disregarding any temporal information, which achieves unreasonably good accuracy, emphasizing the bias towards appearance information and indicating spurious correlations in existing benchmarks. These findings underscore the need for a nuanced understanding of the interplay between motion and appearance in vision-based gait recognition, prompting a reevaluation of the methodological assumptions in this field. Our experiments indicate that "in-the-wild" datasets are less prone to spurious correlations, prompting the need for more diverse and large scale datasets for advancing the field.
Paper Structure (8 sections, 6 equations, 5 figures, 5 tables)

This paper contains 8 sections, 6 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Examples of common modalities used for person recognition in biometrics literature ordered by the amount of appearance information present in each one. Intuitively, sequences of 2D skeletons extracted with pose estimation models contain the least amount of appearance features. However, our experiments show that body proportions play a significant role in automatic recognition.
  • Figure 2: Illustrative example of the effects of different normalization schemes on a skeleton gait sequence. Translations remove the position on screen, while scaling removes some height information. Our experiments show that both procedures have a definite impact on downstream gait performance.
  • Figure 3: The proposed architecture for extracting discriminative appearance information from a single pose. Our model computes self-attention at multiple levels to obtain the appearance representation of the skeleton.
  • Figure 4: Average testing accuracy for all normalization techniques during training on CASIA-B.
  • Figure 5: Rank-1 accuracy for all normalization techniques during training on GREW.