Table of Contents
Fetching ...

Automatic Temporal Segmentation for Post-Stroke Rehabilitation: A Keypoint Detection and Temporal Segmentation Approach for Small Datasets

Jisoo Lee, Tamim Ahmed, Thanassis Rikakis, Pavan Turaga

TL;DR

This paper tackles objective post-stroke rehabilitation assessment in aging populations, where real-world data are scarce. It proposes a three-phase framework that combines 2D hand-object keypoint detection with 1D temporal segmentation using Transformer-based encoders, tailored for small datasets and multiple camera views. Utilizing the ASAR ARAT dataset, it demonstrates data refinement techniques to handle occlusion and missing data, and shows that deeper Transformer-based models—especially when combined with LSTM—improve frame-wise segmentation accuracy across views. The work highlights the clinical relevance of biomechanics-informed design and data-quality improvements, with potential to extend to other domains requiring precise action segmentation under small-data constraints.

Abstract

Rehabilitation is essential and critical for post-stroke patients, addressing both physical and cognitive aspects. Stroke predominantly affects older adults, with 75% of cases occurring in individuals aged 65 and older, underscoring the urgent need for tailored rehabilitation strategies in aging populations. Despite the critical role therapists play in evaluating rehabilitation progress and ensuring the effectiveness of treatment, current assessment methods can often be subjective, inconsistent, and time-consuming, leading to delays in adjusting therapy protocols. This study aims to address these challenges by providing a solution for consistent and timely analysis. Specifically, we perform temporal segmentation of video recordings to capture detailed activities during stroke patients' rehabilitation. The main application scenario motivating this study is the clinical assessment of daily tabletop object interactions, which are crucial for post-stroke physical rehabilitation. To achieve this, we present a framework that leverages the biomechanics of movement during therapy sessions. Our solution divides the process into two main tasks: 2D keypoint detection to track patients' physical movements, and 1D time-series temporal segmentation to analyze these movements over time. This dual approach enables automated labeling with only a limited set of real-world data, addressing the challenges of variability in patient movements and limited dataset availability. By tackling these issues, our method shows strong potential for practical deployment in physical therapy settings, enhancing the speed and accuracy of rehabilitation assessments.

Automatic Temporal Segmentation for Post-Stroke Rehabilitation: A Keypoint Detection and Temporal Segmentation Approach for Small Datasets

TL;DR

This paper tackles objective post-stroke rehabilitation assessment in aging populations, where real-world data are scarce. It proposes a three-phase framework that combines 2D hand-object keypoint detection with 1D temporal segmentation using Transformer-based encoders, tailored for small datasets and multiple camera views. Utilizing the ASAR ARAT dataset, it demonstrates data refinement techniques to handle occlusion and missing data, and shows that deeper Transformer-based models—especially when combined with LSTM—improve frame-wise segmentation accuracy across views. The work highlights the clinical relevance of biomechanics-informed design and data-quality improvements, with potential to extend to other domains requiring precise action segmentation under small-data constraints.

Abstract

Rehabilitation is essential and critical for post-stroke patients, addressing both physical and cognitive aspects. Stroke predominantly affects older adults, with 75% of cases occurring in individuals aged 65 and older, underscoring the urgent need for tailored rehabilitation strategies in aging populations. Despite the critical role therapists play in evaluating rehabilitation progress and ensuring the effectiveness of treatment, current assessment methods can often be subjective, inconsistent, and time-consuming, leading to delays in adjusting therapy protocols. This study aims to address these challenges by providing a solution for consistent and timely analysis. Specifically, we perform temporal segmentation of video recordings to capture detailed activities during stroke patients' rehabilitation. The main application scenario motivating this study is the clinical assessment of daily tabletop object interactions, which are crucial for post-stroke physical rehabilitation. To achieve this, we present a framework that leverages the biomechanics of movement during therapy sessions. Our solution divides the process into two main tasks: 2D keypoint detection to track patients' physical movements, and 1D time-series temporal segmentation to analyze these movements over time. This dual approach enables automated labeling with only a limited set of real-world data, addressing the challenges of variability in patient movements and limited dataset availability. By tackling these issues, our method shows strong potential for practical deployment in physical therapy settings, enhancing the speed and accuracy of rehabilitation assessments.

Paper Structure

This paper contains 13 sections, 1 equation, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Overview of the three-phase process for temporal keypoint-based video action segmentation using a transformer model. This includes the detection of keypoints (such as object centers and hand landmarks), the refinement of the detected outcomes, and the use of the refined time-series data to train a transformer-based model for accurate timestamp prediction.
  • Figure 2: The results of object and hand landmark detection on a sample frame extracted from a video. The left image shows the input frame, the middle image displays the object detection results obtained by TridentNet, and the right image visualizes the 21 hand keypoints extracted using MediaPipe.
  • Figure 3: Camera settings used for collecting video data. Three cameras are positioned orthogonal to each other. Data from the left and right cameras is classified based on the patient's hand usage: palm-facing data is contralateral while back-of-the-hand-facing data is ipsilateral.
  • Figure 4: The attention map for contralateral view data. The 10 images on the left show the attention scores of a single head across each of the 10 encoder layers of the Trans10 model. These images reveal how the attention scores evolve with increasing layer depth. The right image presents the average attention scores across the 8 heads of the multi-head attention module in the final encoder layer. This visualization highlights the aggregated attention patterns after processing through all encoder layers.
  • Figure 5: Qualitative results of temporal segmentation. The top row of plots displays the actual segment labels of the data, while the bottom row presents the predicted results. The leftmost plot is a sample from the top view data, the middle plot shows an example from the contralateral view data, and the rightmost plot represents the ipsilateral view data. These plots provide a visual comparison between the ground truth segmentations and the model's predictions across different perspectives.