Table of Contents
Fetching ...

The Potential and Limitations of Vision-Language Models for Human Motion Understanding: A Case Study in Data-Driven Stroke Rehabilitation

Victor Li, Naveenraj Kamalakannan, Avinash Parnandi, Heidi Schambra, Carlos Fernandez-Granda

TL;DR

This paper evaluates vision-language models (VLMs) for two data-driven stroke rehabilitation tasks: automatic rehabilitation dose quantification and impairment assessment from video. Using a dataset of 80 subjects and two evaluation regimes, the study probes activity identification, dose via functional primitives, and impairment via Fugl-Meyer prompts. It finds that current VLMs lack fine-grained motion understanding needed for precise quantification, with dose estimates near non-visual baselines and unreliable impairment scores. Yet, with prompt optimization, targeted cropping, and post-processing (PRIM-RS), VLMs can robustly classify high-level activities from few frames and achieve partial dose quantification in structured tasks, highlighting both limitations and potential paths for clinical video analysis.

Abstract

Vision-language models (VLMs) have demonstrated remarkable performance across a wide range of computer-vision tasks, sparking interest in their potential for digital health applications. Here, we apply VLMs to two fundamental challenges in data-driven stroke rehabilitation: automatic quantification of rehabilitation dose and impairment from videos. We formulate these problems as motion-identification tasks, which can be addressed using VLMs. We evaluate our proposed framework on a cohort of 29 healthy controls and 51 stroke survivors. Our results show that current VLMs lack the fine-grained motion understanding required for precise quantification: dose estimates are comparable to a baseline that excludes visual information, and impairment scores cannot be reliably predicted. Nevertheless, several findings suggest future promise. With optimized prompting and post-processing, VLMs can classify high-level activities from a few frames, detect motion and grasp with moderate accuracy, and approximate dose counts within 25% of ground truth for mildly impaired and healthy participants, all without task-specific training or finetuning. These results highlight both the current limitations and emerging opportunities of VLMs for data-driven stroke rehabilitation and broader clinical video analysis.

The Potential and Limitations of Vision-Language Models for Human Motion Understanding: A Case Study in Data-Driven Stroke Rehabilitation

TL;DR

This paper evaluates vision-language models (VLMs) for two data-driven stroke rehabilitation tasks: automatic rehabilitation dose quantification and impairment assessment from video. Using a dataset of 80 subjects and two evaluation regimes, the study probes activity identification, dose via functional primitives, and impairment via Fugl-Meyer prompts. It finds that current VLMs lack fine-grained motion understanding needed for precise quantification, with dose estimates near non-visual baselines and unreliable impairment scores. Yet, with prompt optimization, targeted cropping, and post-processing (PRIM-RS), VLMs can robustly classify high-level activities from few frames and achieve partial dose quantification in structured tasks, highlighting both limitations and potential paths for clinical video analysis.

Abstract

Vision-language models (VLMs) have demonstrated remarkable performance across a wide range of computer-vision tasks, sparking interest in their potential for digital health applications. Here, we apply VLMs to two fundamental challenges in data-driven stroke rehabilitation: automatic quantification of rehabilitation dose and impairment from videos. We formulate these problems as motion-identification tasks, which can be addressed using VLMs. We evaluate our proposed framework on a cohort of 29 healthy controls and 51 stroke survivors. Our results show that current VLMs lack the fine-grained motion understanding required for precise quantification: dose estimates are comparable to a baseline that excludes visual information, and impairment scores cannot be reliably predicted. Nevertheless, several findings suggest future promise. With optimized prompting and post-processing, VLMs can classify high-level activities from a few frames, detect motion and grasp with moderate accuracy, and approximate dose counts within 25% of ground truth for mildly impaired and healthy participants, all without task-specific training or finetuning. These results highlight both the current limitations and emerging opportunities of VLMs for data-driven stroke rehabilitation and broader clinical video analysis.

Paper Structure

This paper contains 14 sections, 2 equations, 15 figures, 9 tables.

Figures (15)

  • Figure 1: Vision-Language Models (VLMs) for Data-Driven Stroke Rehabilitation.(a) A VLM can function as a video question-answering system. The question, or prompt, and image frames from the video are separately encoded into tokens, which are fed-forward through a transformer-based backbone. The backbone's output is then decoded into text. (b)Activity identification: The VLM is provided $8$ frames uniformly sampled from a video ($4$ frames are shown due to space constraints) and a description of nine rehabilitation activities. The VLM output classifies the activity in the video. (c)Dose quantification: The VLM is provided a video segment along with a textual prompt. Its output is then utilized to classify the segment into one of five functional motions or primitives. Rehabilitation dose is quantified by counting the primitives over the whole video. (d)Impairment quantification: The VLM processes video segments of subjects performing mobility exercises from the Fugl-Meyer Assessment (FMA), a standard clinical evaluation of impairment. The input prompt is the corresponding FMA item. The outputs are aggregated to estimate the FMA score of the subject.
  • Figure 2: VLMs accurately classify high-level activities. Shown are confusion matrices for activity identification with Qwen2.5-VL-7B-Instruct ($N=640$ videos). Each cell shows the fraction of samples from a true activity (row) that were predicted as a given activity (column). Left: Independent evaluation, where the VLM prompting is based on a pre-existing description of the activities. Right: Prompt-optimized evaluation, where the VLM prompts are tailored to Qwen2.5-VL-7B-Instruct using videos from $2$ held-out control subjects (see Methods). The optimized prompts achieve higher accuracy, as indicated by the dark diagonal cells. The most frequent mistakes are confusing face wash and RTT with the brushing and combing activities, respectively, possibly due to the small object sizes.
  • Figure 4: With prompt optimization and post-processing, VLMs can quantify rehabilitation dosage for structured activities to some extent. The first five panels shows scatterplots comparing the predicted and ground-truth counts for $38$ rehabilitation videos corresponding to two structured tasks (radial table top and shelf). Predictions are obtained using the proposed PRIM-RS method, which provides optimized prompts to the VLM Qwen2.5-VL-32B-Instruct and applies post-processing to its output. The relative counting error (RCE) is primitive-specific, measuring the counting error for each primitive normalized by its total number of ground-truth instances in the $38$-video dataset. The RCE is moderately low at $\approx25\%$ for reach, reposition, and idle. It is subpar for transport and stabilize, illustrating the difficulty of differentiating these primitives. The bottom-right plot shows the average RCE across subjects with different impairment levels: the performance degrades as the severity level increases.
  • Figure 5: VLMs fail at impairment quantification.Left: The scatterplot shows the ground-truth Fugl-Meyer score assessed by a trained human expert against the predicted Fugl-Meyer score by the VLM Qwen2.5-VL-72B-Instruct. The points would lie along the dotted diagonal line for a model that matches human rating. Instead, for two different prompting methods, the predicted Fugl-Meyer score is nearly constant across severity levels. Right: The bar plots show the average error of the VLM for different subsections of the Fugl-Meyer assessment, again using two prompting methods. For comparison, ONES displays the average error for a model that only predicts $1$. The subsections are, from top to bottom, Arm---Flexor Synergy, Arm---Extensor Synergy, Arm--Movement Combining Synergy, Arm---Movement Out of Synergy, Wrist, Hand, and Coordination/Speed. Both methods perform comparatively to the non-informative ONES baseline. Bars show mean $\pm$$1$ sem.
  • Figure 6: Failure modes of VLMs applied to stroke rehabilitation.(a) Large object bias: Left: The model misclassifies a combing activity as an RTT exercise, likely driven by contextual cues from larger objects such as the black mat and wristbands. Right: The model exhibits hand attribution errors, potentially due to the left hand's interaction with a dominant object (water bottle). (b) Overreliance on 2D semantics: Timeline of a $6$ s video with dotted lines marking $0.533$ s segments. Colors indicate grasp state: gray (no grasp), orange (grasp), and black (mixed). The ground truth contains two distinct grasps. When queried, "Is the subject's right hand grasping the pink object? Answer 'Yes' or 'No' directly." for each segment, the VLM misinterprets visual proximity as physical contact and fails to distinguish the two separate grasp events. (c) Hallucination: A patient with severe impairment attempts a shoulder flexion task (ground truth Fugl-Meyer score of $0$). For reference, the diagram on the right depicts successful completion of the task. The model hallucinates movement and incorrectly reports task success.
  • ...and 10 more figures