Table of Contents
Fetching ...

MA^2: A Self-Supervised and Motion Augmenting Autoencoder for Gait-Based Automatic Disease Detection

Yiqun Liu, Ke Zhang, Yin Zhu

TL;DR

MA2, a GRF-based self-supervised and motion augmenting auto-encoder, which models the ADD task as an encoder-decoder paradigm, and has SOTA performance of 90.91% accuracy on 1% limited pathological GRF samples with labels, and good generalization ability on scalable Parkinson disease dataset.

Abstract

Ground reaction force (GRF) is the force exerted by the ground on a body in contact with it. GRF-based automatic disease detection (ADD) has become an emerging medical diagnosis method, which aims to learn and identify disease patterns corresponding to different gait pressures based on deep learning methods. Although existing ADD methods can save doctors time in making diagnoses, training deep models still struggles with the cost caused by the labeling engineering for a large number of gait diagnostic data for subjects. On the other hand, the accuracy of the deep model under the unified benchmark GRF dataset and the generalization ability on scalable gait datasets need to be further improved. To address these issues, we propose MA2, a GRF-based self-supervised and motion augmenting auto-encoder, which models the ADD task as an encoder-decoder paradigm. In the encoder, we introduce an embedding block including the 3-layer 1D convolution for extracting the token and a mask generator to randomly mask out the sequence of tokens to maximize the model's potential to capture high-level, discriminative, intrinsic representations. whereafter, the decoder utilizes this information to reconstruct the pixel sequence of the origin input and calculate the reconstruction loss to optimize the network. Moreover, the backbone of an auto-encoder is multi-head self-attention that can consider the global information of the token from the input, not just the local neighborhood. This allows the model to capture generalized contextual information. Extensive experiments demonstrate MA2 has SOTA performance of 90.91% accuracy on 1% limited pathological GRF samples with labels, and good generalization ability of 78.57% accuracy on scalable Parkinson disease dataset.

MA^2: A Self-Supervised and Motion Augmenting Autoencoder for Gait-Based Automatic Disease Detection

TL;DR

MA2, a GRF-based self-supervised and motion augmenting auto-encoder, which models the ADD task as an encoder-decoder paradigm, and has SOTA performance of 90.91% accuracy on 1% limited pathological GRF samples with labels, and good generalization ability on scalable Parkinson disease dataset.

Abstract

Ground reaction force (GRF) is the force exerted by the ground on a body in contact with it. GRF-based automatic disease detection (ADD) has become an emerging medical diagnosis method, which aims to learn and identify disease patterns corresponding to different gait pressures based on deep learning methods. Although existing ADD methods can save doctors time in making diagnoses, training deep models still struggles with the cost caused by the labeling engineering for a large number of gait diagnostic data for subjects. On the other hand, the accuracy of the deep model under the unified benchmark GRF dataset and the generalization ability on scalable gait datasets need to be further improved. To address these issues, we propose MA2, a GRF-based self-supervised and motion augmenting auto-encoder, which models the ADD task as an encoder-decoder paradigm. In the encoder, we introduce an embedding block including the 3-layer 1D convolution for extracting the token and a mask generator to randomly mask out the sequence of tokens to maximize the model's potential to capture high-level, discriminative, intrinsic representations. whereafter, the decoder utilizes this information to reconstruct the pixel sequence of the origin input and calculate the reconstruction loss to optimize the network. Moreover, the backbone of an auto-encoder is multi-head self-attention that can consider the global information of the token from the input, not just the local neighborhood. This allows the model to capture generalized contextual information. Extensive experiments demonstrate MA2 has SOTA performance of 90.91% accuracy on 1% limited pathological GRF samples with labels, and good generalization ability of 78.57% accuracy on scalable Parkinson disease dataset.

Paper Structure

This paper contains 11 sections, 3 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: The framework of the MA2. The framework mainly conducts automatic disease detection (ADD) based on GRF through two steps in sequence: i) pre-training (PT); ii) fine-tuning (FT). Concretely, the gait input data is $\mathbf{X}\in\mathbb{R}^{B \times C \times \ell}$, where $B$ is the batch size with an initial value of 1. $C$ (numbers of channels) is 10 and $\ell$ (sequence length) is 101. In the pre-training phase, $\mathbf{X}$ is utilized for capturing the high-level representations and then reconstructing the output $\mathbf{XR}$ with the same shape of input through module 1 Embedding Block $\rightarrow$2 ViT Encoder Block $\rightarrow$3 Layer Norm ioffe2015batch$\rightarrow$4 Mask Token $\rightarrow$5 ViT Decoder Block. On the other hand, in the fine-tuning phase, validation data $\mathbf{X}$ can only pass the 1 Embedding Block $\rightarrow$2 ViT Encoder Block $\rightarrow$6 MLP projector for inferring the health status ( Pathological/ Healthy) of the subject by plantar gait pressure. Figure \ref{['pic3']} and Figure \ref{['pic4']} show the operation details of module 1. Similarly, the framework construction and operation details of Block 2 ViT Encoder Block and 4 Mask Token are shown in Figure \ref{['pic56']} and Figure \ref{['pic7']}, respectively.
  • Figure 2: The process of token embedding and position embedding: the original input data $\mathbf{X}$ is obtained by 1D convolution with a kernel size of 3 $\times$ 1 to obtain tokens $\mathbf{T}_{x}$ and transpose them to obtain tokens $\mathbf{T}_{x}'$, then the position embedding $\mathbf{P}_{i}$ is obtained by the sine-cosine encoding, and finally $\mathbf{T}_{x}'$ is added to obtain the tokens $\mathbf{TX}$.
  • Figure 3: The mask process for the token is shown above: First, set a fixed mask rate of $m_r=75\%$, and then randomly shuffle the 101 token sequences, take the first $101 \times (1-m_{r})=26$ feature vectors (only briefly expressed as 6 and 8 in the figure), and restore the original order of the 26 token vectors. These 26 vectors are fed into the model for the next encoder training, thus masking out the remaining 75 vectors.
  • Figure 4: multi-head attention mechanisms: Input $\mathbf{T}_{vx}$ in $h$ mapping triplet matrices and conduct the linear transformation, where $h$ is the number of heads of the feature map, and then concatenate the 12 triplet matrices after calculating the self-attention score, respectively. Finally, the high-level features $\mathbf{E}$ are obtained.
  • Figure 5: The mask token operation is to combine the token part $\mathbf{T}_{mx}$ that was previously hidden by the mask geneator with the high-level feature $\mathbf{E}_{vx}$ without disturbing the order, carry out a position embedding on each of them before the concatenation and fusion, and finally get $\mathbf{ET}_{x}$.
  • ...and 2 more figures