Benchmarking Video Foundation Models for Remote Parkinson's Disease Screening

Md Saiful Islam; Ekram Hossain; Abdelrahman Abdelkader; Tariq Adnan; Fazla Rabbi Mashrur; Sooyong Park; Praveen Kumar; Qasim Sudais; Natalia Chunga; Nami Shah; Jan Freyberg; Christopher Kanan; Ruth Schneider; Ehsan Hoque

Benchmarking Video Foundation Models for Remote Parkinson's Disease Screening

Md Saiful Islam, Ekram Hossain, Abdelrahman Abdelkader, Tariq Adnan, Fazla Rabbi Mashrur, Sooyong Park, Praveen Kumar, Qasim Sudais, Natalia Chunga, Nami Shah, Jan Freyberg, Christopher Kanan, Ruth Schneider, Ehsan Hoque

TL;DR

Overall, this work establishes a rigorous baseline for VFM-based PD screening and provides a roadmap for selecting suitable tasks and architectures in remote neurological monitoring.

Abstract

Remote, video-based assessments offer a scalable pathway for Parkinson's disease (PD) screening. While traditional approaches rely on handcrafted features mimicking clinical scales, recent advances in video foundation models (VFMs) enable representation learning without task-specific customization. However, the comparative effectiveness of different VFM architectures across diverse clinical tasks remains poorly understood. We present a large-scale systematic study using a novel video dataset from 1,888 participants (727 with PD), comprising 32,847 videos across 16 standardized clinical tasks. We evaluate seven state-of-the-art VFMs -- including VideoPrism, V-JEPA, ViViT, and VideoMAE -- to determine their robustness in clinical screening. By evaluating frozen embeddings with a linear classification head, we demonstrate that task saliency is highly model-dependent: VideoPrism excels in capturing visual speech kinematics (no audio) and facial expressivity, while V-JEPA proves superior for upper-limb motor tasks. Notably, TimeSformer remains highly competitive for rhythmic tasks like finger tapping. Our experiments yield AUCs of 76.4-85.3% and accuracies of 71.5-80.6%. While high specificity (up to 90.3%) suggests strong potential for ruling out healthy individuals, the lower sensitivity (43.2-57.3%) highlights the need for task-aware calibration and integration of multiple tasks and modalities. Overall, this work establishes a rigorous baseline for VFM-based PD screening and provides a roadmap for selecting suitable tasks and architectures in remote neurological monitoring. Code and anonymized structured data are publicly available: https://anonymous.4open.science/r/parkinson\_video\_benchmarking-A2C5

Benchmarking Video Foundation Models for Remote Parkinson's Disease Screening

TL;DR

Overall, this work establishes a rigorous baseline for VFM-based PD screening and provides a roadmap for selecting suitable tasks and architectures in remote neurological monitoring.

Abstract

Paper Structure (12 sections, 3 figures, 1 table)

This paper contains 12 sections, 3 figures, 1 table.

Introduction
Methods
Dataset and Ethics
Standardized Clinical Tasks and Domain Classification
Video Foundation Model Architectures
Experimental Protocol
Results and Discussions
Task-wise Saliency Patterns and Clinical Implications
Architecture-Specific Strengths Across Clinical Domains
Multi-View and Oversampling Ablations
Limitations
Conclusion

Figures (3)

Figure 1: Overview of the benchmarking framework for Parkinson's disease (PD) screening using video foundation models (VFMs).(A) Video Dataset: Our study utilizes a large-scale dataset of $32,847$ videos from $1,888$ participants ($727$ with PD) performing $16$ standardized clinical tasks. (B) Evaluation Pipeline: Raw video data is processed through a suite of state-of-the-art frozen VFMs to extract latent representations. These embeddings are evaluated using a task-specific linear classification head to differentiate between PD and non-PD participants. (C) Task-Model Saliency: Systematic evaluation to investigate architecture-specific strengths.
Figure 2: Dataset characteristics. (A) The age distribution of participants categorized by PD status; (B) the distribution of clinical PD stage (Hoehn & Yahr scale) when available; and (C) the number of videos collected for each of the 16 standardized tasks.
Figure 3: Comparative performance of VFMs across broad clinical domains. This plot organizes results by broad clinical category along the x-axis, with color-coded boxes showing the performance distribution of each VFM within that domain. Individual data points, distinguished by unique marker shapes (defined in the bottom legend), represent the specific AUC achieved by a model on a distinct clinical task. High-level analysis reveals that the "Upper-Limb Motor Kinematics" domain generally yields the highest predictive performance. At the task level, distinct model specializations emerge: VideoPrism shows superior consistency across "Facial Expressivity," "Visual Speech Kinematics," and "Oculo-Cervical & Cognitive Control". In contrast, V-JEPA and its variant dominate in "Upper-Limb Motor Kinematics," achieving the highest overall scores on tasks like flip-palm.

Benchmarking Video Foundation Models for Remote Parkinson's Disease Screening

TL;DR

Abstract

Benchmarking Video Foundation Models for Remote Parkinson's Disease Screening

Authors

TL;DR

Abstract

Table of Contents

Figures (3)