Gait Recognition from Highly Compressed Videos
Andrei Niculae, Andy Catruna, Adrian Cosma, Daniel Rosner, Emilian Radoi
TL;DR
This work tackles gait recognition from highly compressed surveillance videos by decoupling artifact mitigation from pose estimation. It introduces a two-stage pipeline: automatic generation of degraded training data using high-quality PsyMo videos with ground-truth poses from ViTPose, and a task-adapted artifact-correction module (FBCNN) trained with a frozen HRNet pose estimator to maximize pose accuracy through the loss $L_{AC} = \sum_V \sum_{i=1}^{N} |\hat{p}^*_i - p_i|$. The approach yields superior pose estimation on degraded footage (e.g., AP up to $0.956$) while preserving performance on high-quality data, and it improves downstream gait recognition accuracy compared with direct fine-tuning methods. The results demonstrate that artifact correction guided by a downstream task can reliably improve gait analysis in real-world, low-quality surveillance settings, with potential applicability to broader in-the-wild datasets like DenseGait.
Abstract
Surveillance footage represents a valuable resource and opportunities for conducting gait analysis. However, the typical low quality and high noise levels in such footage can severely impact the accuracy of pose estimation algorithms, which are foundational for reliable gait analysis. Existing literature suggests a direct correlation between the efficacy of pose estimation and the subsequent gait analysis results. A common mitigation strategy involves fine-tuning pose estimation models on noisy data to improve robustness. However, this approach may degrade the downstream model's performance on the original high-quality data, leading to a trade-off that is undesirable in practice. We propose a processing pipeline that incorporates a task-targeted artifact correction model specifically designed to pre-process and enhance surveillance footage before pose estimation. Our artifact correction model is optimized to work alongside a state-of-the-art pose estimation network, HRNet, without requiring repeated fine-tuning of the pose estimation model. Furthermore, we propose a simple and robust method for obtaining low quality videos that are annotated with poses in an automatic manner with the purpose of training the artifact correction model. We systematically evaluate the performance of our artifact correction model against a range of noisy surveillance data and demonstrate that our approach not only achieves improved pose estimation on low-quality surveillance footage, but also preserves the integrity of the pose estimation on high resolution footage. Our experiments show a clear enhancement in gait analysis performance, supporting the viability of the proposed method as a superior alternative to direct fine-tuning strategies. Our contributions pave the way for more reliable gait analysis using surveillance data in real-world applications, regardless of data quality.
