Table of Contents
Fetching ...

Quantitative Gait Analysis from Single RGB Videos Using a Dual-Input Transformer-Based Network

Hiep Dinh, Son Le, My Than, Minh Ho, Nicolas Vuillerme, Hieu Pham

TL;DR

This paper addresses the need for accessible, quantitative gait analysis from standard RGB video by introducing a dual-input convolutional Transformer (DPG) that processes two skeletal-pattern images derived from OpenPose to regress gait metrics such as $GDI$, knee flexion angle, step length, and cadence. The model leverages two parallel CNN branches, feature concatenation to a large latent vector, and fully connected layers to output a single gait parameter, achieving competitive MAE on a cerebral palsy dataset while reducing resource requirements. The study demonstrates improved accuracy for $GDI$ and knee flexion over state-of-the-art methods, with some cadence limitations, and provides open-source code and trained models to foster broader adoption. Overall, the approach enables markerless, single-view gait analysis suitable for resource-constrained settings and telehealth, potentially broadening clinical access to quantitative gait metrics.

Abstract

Gait and movement analysis have become a well-established clinical tool for diagnosing health conditions, monitoring disease progression for a wide spectrum of diseases, and to implement and assess treatment, surgery and or rehabilitation interventions. However, quantitative motion assessment remains limited to costly motion capture systems and specialized personnel, restricting its accessibility and broader application. Recent advancements in deep neural networks have enabled quantitative movement analysis using single-camera videos, offering an accessible alternative to conventional motion capture systems. In this paper, we present an efficient approach for clinical gait analysis through a dual-pattern input convolutional Transformer network. The proposed system leverages a dual-input Transformer model to estimate essential gait parameters from single RGB videos captured by a single-view camera. The system demonstrates high accuracy in estimating critical metrics such as the gait deviation index (GDI), knee flexion angle, step length, and walking cadence, validated on a dataset of individuals with movement disorders. Notably, our approach surpasses state-of-the-art methods in various scenarios, using fewer resources and proving highly suitable for clinical application, particularly in resource-constrained environments.

Quantitative Gait Analysis from Single RGB Videos Using a Dual-Input Transformer-Based Network

TL;DR

This paper addresses the need for accessible, quantitative gait analysis from standard RGB video by introducing a dual-input convolutional Transformer (DPG) that processes two skeletal-pattern images derived from OpenPose to regress gait metrics such as , knee flexion angle, step length, and cadence. The model leverages two parallel CNN branches, feature concatenation to a large latent vector, and fully connected layers to output a single gait parameter, achieving competitive MAE on a cerebral palsy dataset while reducing resource requirements. The study demonstrates improved accuracy for and knee flexion over state-of-the-art methods, with some cadence limitations, and provides open-source code and trained models to foster broader adoption. Overall, the approach enables markerless, single-view gait analysis suitable for resource-constrained settings and telehealth, potentially broadening clinical access to quantitative gait metrics.

Abstract

Gait and movement analysis have become a well-established clinical tool for diagnosing health conditions, monitoring disease progression for a wide spectrum of diseases, and to implement and assess treatment, surgery and or rehabilitation interventions. However, quantitative motion assessment remains limited to costly motion capture systems and specialized personnel, restricting its accessibility and broader application. Recent advancements in deep neural networks have enabled quantitative movement analysis using single-camera videos, offering an accessible alternative to conventional motion capture systems. In this paper, we present an efficient approach for clinical gait analysis through a dual-pattern input convolutional Transformer network. The proposed system leverages a dual-input Transformer model to estimate essential gait parameters from single RGB videos captured by a single-view camera. The system demonstrates high accuracy in estimating critical metrics such as the gait deviation index (GDI), knee flexion angle, step length, and walking cadence, validated on a dataset of individuals with movement disorders. Notably, our approach surpasses state-of-the-art methods in various scenarios, using fewer resources and proving highly suitable for clinical application, particularly in resource-constrained environments.
Paper Structure (10 sections, 5 equations, 2 figures, 1 table)

This paper contains 10 sections, 5 equations, 2 figures, 1 table.

Figures (2)

  • Figure 1: Motivation for the proposed approach. Rather than relying on an ultra-expensive motion capture system within the current clinical workflow, we propose in this work capturing motion data with a single standard mobile camera. Using the OpenPose algorithm cao2017realtime, keypoint trajectories are extracted from sagittal-plane video, then connected and fed into a dual-input Transformer network vaswani2017attention to derive clinically relevant metrics such as GDI, knee flexion angle, step length, and walking cadence. The figure is reused from Kidzinski et al.kidzinski2020deep for illustration purpose.
  • Figure 2: An overview of our approached DPG model. First, it is designed to process skeletal sequence data extracted from original videos using OpenPose. Each skeletal sequence is represented by two images: a lower-body landmark coordination pattern image and a lower-body landmark coordination image. The model architecture consists of a primary 3-layer convolutional block, each layer containing a convolution and max-pooling component, to extract spatial features. Finally, a fully connected network with four layers successively reduces the data vector size from 65536 to 512, 256, 128, and 64, outputting a single prediction vector of size 1$\times$1.