Table of Contents
Fetching ...

Gait Recognition from Highly Compressed Videos

Andrei Niculae, Andy Catruna, Adrian Cosma, Daniel Rosner, Emilian Radoi

TL;DR

This work tackles gait recognition from highly compressed surveillance videos by decoupling artifact mitigation from pose estimation. It introduces a two-stage pipeline: automatic generation of degraded training data using high-quality PsyMo videos with ground-truth poses from ViTPose, and a task-adapted artifact-correction module (FBCNN) trained with a frozen HRNet pose estimator to maximize pose accuracy through the loss $L_{AC} = \sum_V \sum_{i=1}^{N} |\hat{p}^*_i - p_i|$. The approach yields superior pose estimation on degraded footage (e.g., AP up to $0.956$) while preserving performance on high-quality data, and it improves downstream gait recognition accuracy compared with direct fine-tuning methods. The results demonstrate that artifact correction guided by a downstream task can reliably improve gait analysis in real-world, low-quality surveillance settings, with potential applicability to broader in-the-wild datasets like DenseGait.

Abstract

Surveillance footage represents a valuable resource and opportunities for conducting gait analysis. However, the typical low quality and high noise levels in such footage can severely impact the accuracy of pose estimation algorithms, which are foundational for reliable gait analysis. Existing literature suggests a direct correlation between the efficacy of pose estimation and the subsequent gait analysis results. A common mitigation strategy involves fine-tuning pose estimation models on noisy data to improve robustness. However, this approach may degrade the downstream model's performance on the original high-quality data, leading to a trade-off that is undesirable in practice. We propose a processing pipeline that incorporates a task-targeted artifact correction model specifically designed to pre-process and enhance surveillance footage before pose estimation. Our artifact correction model is optimized to work alongside a state-of-the-art pose estimation network, HRNet, without requiring repeated fine-tuning of the pose estimation model. Furthermore, we propose a simple and robust method for obtaining low quality videos that are annotated with poses in an automatic manner with the purpose of training the artifact correction model. We systematically evaluate the performance of our artifact correction model against a range of noisy surveillance data and demonstrate that our approach not only achieves improved pose estimation on low-quality surveillance footage, but also preserves the integrity of the pose estimation on high resolution footage. Our experiments show a clear enhancement in gait analysis performance, supporting the viability of the proposed method as a superior alternative to direct fine-tuning strategies. Our contributions pave the way for more reliable gait analysis using surveillance data in real-world applications, regardless of data quality.

Gait Recognition from Highly Compressed Videos

TL;DR

This work tackles gait recognition from highly compressed surveillance videos by decoupling artifact mitigation from pose estimation. It introduces a two-stage pipeline: automatic generation of degraded training data using high-quality PsyMo videos with ground-truth poses from ViTPose, and a task-adapted artifact-correction module (FBCNN) trained with a frozen HRNet pose estimator to maximize pose accuracy through the loss . The approach yields superior pose estimation on degraded footage (e.g., AP up to ) while preserving performance on high-quality data, and it improves downstream gait recognition accuracy compared with direct fine-tuning methods. The results demonstrate that artifact correction guided by a downstream task can reliably improve gait analysis in real-world, low-quality surveillance settings, with potential applicability to broader in-the-wild datasets like DenseGait.

Abstract

Surveillance footage represents a valuable resource and opportunities for conducting gait analysis. However, the typical low quality and high noise levels in such footage can severely impact the accuracy of pose estimation algorithms, which are foundational for reliable gait analysis. Existing literature suggests a direct correlation between the efficacy of pose estimation and the subsequent gait analysis results. A common mitigation strategy involves fine-tuning pose estimation models on noisy data to improve robustness. However, this approach may degrade the downstream model's performance on the original high-quality data, leading to a trade-off that is undesirable in practice. We propose a processing pipeline that incorporates a task-targeted artifact correction model specifically designed to pre-process and enhance surveillance footage before pose estimation. Our artifact correction model is optimized to work alongside a state-of-the-art pose estimation network, HRNet, without requiring repeated fine-tuning of the pose estimation model. Furthermore, we propose a simple and robust method for obtaining low quality videos that are annotated with poses in an automatic manner with the purpose of training the artifact correction model. We systematically evaluate the performance of our artifact correction model against a range of noisy surveillance data and demonstrate that our approach not only achieves improved pose estimation on low-quality surveillance footage, but also preserves the integrity of the pose estimation on high resolution footage. Our experiments show a clear enhancement in gait analysis performance, supporting the viability of the proposed method as a superior alternative to direct fine-tuning strategies. Our contributions pave the way for more reliable gait analysis using surveillance data in real-world applications, regardless of data quality.
Paper Structure (10 sections, 1 equation, 5 figures, 3 tables)

This paper contains 10 sections, 1 equation, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Examples of highly inaccurate pose estimations under severe video degradation (second row), compared to ground truth poses (first row). Artifacts introduced by typical video compression methods hinder the performance of state-of-the-art pose estimation models sun2019deep.
  • Figure 2: Overall diagram of our method for training an artifact correction model without any manual labels. We utilize a pose estimation model on high definition video frames to obtain robust ground truth poses. The quality of the videos is highly decreased with H.264 compression to simulate real-world environments. We train an artifact correction model to alter the image so that a separate frozen pose estimation model obtains poses close to the ground truth.
  • Figure 3: Examples of poses extracted from highly degraded videos with the 3 experimental approaches on the test set. First row - Ground Truth poses; Second row - poses obtained with the original pre-trained HRNet model; Third row - poses obtained with fine-tuned HRNet; Fourth row - poses obtained artifact correction model (FBCNN) in combination with the original HRNet model. The proposed method obtains the most accurate poses, closely resembling those from the ground truth.
  • Figure 4: Histograms of L2 distances from ground truth of test-set poses. a) poses obtained with pre-trained HRNet (mean: 173.32); b) poses obtained with finetuned HRNet (mean: 107.53); c) poses obtained with FBCNN + pre-trained HRNet (mean: 79.13)
  • Figure 5: Examples of corrections made by the fine-tuned FBCNN model on degraded videos. The model learns to obscure the background with the purpose of better highlighting the human body and its joints.