Learning to Estimate Critical Gait Parameters from Single-View RGB Videos with Transformer-Based Attention Network

Quoc Hung T. Le; Hieu H. Pham

Learning to Estimate Critical Gait Parameters from Single-View RGB Videos with Transformer-Based Attention Network

Quoc Hung T. Le, Hieu H. Pham

TL;DR

Empirical evaluations on a public dataset of cerebral palsy patients indicate that the proposed framework surpasses current state-of-the-art approaches and show significant improvements in predicting general gait parameters, while utilizing fewer parameters and alleviating the need for manual feature extraction.

Abstract

Musculoskeletal diseases and cognitive impairments in patients lead to difficulties in movement as well as negative effects on their psychological health. Clinical gait analysis, a vital tool for early diagnosis and treatment, traditionally relies on expensive optical motion capture systems. Recent advances in computer vision and deep learning have opened the door to more accessible and cost-effective alternatives. This paper introduces a novel spatio-temporal Transformer network to estimate critical gait parameters from RGB videos captured by a single-view camera. Empirical evaluations on a public dataset of cerebral palsy patients indicate that the proposed framework surpasses current state-of-the-art approaches and show significant improvements in predicting general gait parameters (including Walking Speed, Gait Deviation Index - GDI, and Knee Flexion Angle at Maximum Extension), while utilizing fewer parameters and alleviating the need for manual feature extraction.

Learning to Estimate Critical Gait Parameters from Single-View RGB Videos with Transformer-Based Attention Network

TL;DR

Abstract

Paper Structure (12 sections, 3 equations, 3 figures, 1 table)

This paper contains 12 sections, 3 equations, 3 figures, 1 table.

Introduction
Related Works
Methodology
Problem Formulation
Network Architecture
Experiments
Datasets and Experimental Settings
Implementation Details
Experimental Results
Discussion and Conclusion
Acknowledgments
Compliance with Ethical Standards

Figures (3)

Figure 1: Overview of our approach for $T = 4$. We first project the 2D coordinates of each joint to a $D$-dimensional space. Our architecture has two attention blocks: spatial and temporal attention block, adopting multi-head attention from vaswani2017attentiondosovitskiy2020image. The spatial attention block extracts spatial information by attending to every other joint in the same frame. The temporal attention block captures temporal dependencies among the frames given a motion sequence. Lastly, we use a Fully Connected Neural Network to output the final parameters.
Figure 2: Comparison of the number of parameters of 1D-CNN and the proposed Spatio-Temporal Transformer network.
Figure 3: Visualizing the attention matrix $\mathbf{A}$ of different heads in the spatial attention block at timestep $t=46$ (top) and temporal attention block from timestep $t=60$ to $t=105$ (bottom) for GDI measurement. Entry $\mathbf{A}_{i,j}$ denotes the attention weight between joint $i$ and joint $j$ or frame $i$ and frame $j$

Learning to Estimate Critical Gait Parameters from Single-View RGB Videos with Transformer-Based Attention Network

TL;DR

Abstract

Learning to Estimate Critical Gait Parameters from Single-View RGB Videos with Transformer-Based Attention Network

Authors

TL;DR

Abstract

Table of Contents

Figures (3)