Table of Contents
Fetching ...

BioGait-VLM: A Tri-Modal Vision-Language-Biomechanics Framework for Interpretable Clinical Gait Assessment

Erdong Chen, Yuyang Ji, Jacob K. Greenberg, Benjamin Steel, Faraz Arkam, Abigail Lewis, Pranay Singh, Feng Liu

TL;DR

This work proposes BioGait-VLM, a tri-modal Vision-Language-Biomechanics framework for interpretable clinical gait assessment that achieves state-of-the-art recognition accuracy and confirms that biomechanical tokens significantly improve clinical plausibility and evidence grounding.

Abstract

Video-based Clinical Gait Analysis often suffers from poor generalization as models overfit environmental biases instead of capturing pathological motion. To address this, we propose BioGait-VLM, a tri-modal Vision-Language-Biomechanics framework for interpretable clinical gait assessment. Unlike standard video encoders, our architecture incorporates a Temporal Evidence Distillation branch to capture rhythmic dynamics and a Biomechanical Tokenization branch that projects 3D skeleton sequences into language-aligned semantic tokens. This enables the model to explicitly reason about joint mechanics independent of visual shortcuts. To ensure rigorous benchmarking, we augment the public GAVD dataset with a high-fidelity Degenerative Cervical Myelopathy (DCM) cohort to form a unified 8-class taxonomy, establishing a strict subject-disjoint protocol to prevent data leakage. Under this setting, BioGait-VLM achieves state-of-the-art recognition accuracy. Furthermore, a blinded expert study confirms that biomechanical tokens significantly improve clinical plausibility and evidence grounding, offering a path toward transparent, privacy-enhanced gait assessment.

BioGait-VLM: A Tri-Modal Vision-Language-Biomechanics Framework for Interpretable Clinical Gait Assessment

TL;DR

This work proposes BioGait-VLM, a tri-modal Vision-Language-Biomechanics framework for interpretable clinical gait assessment that achieves state-of-the-art recognition accuracy and confirms that biomechanical tokens significantly improve clinical plausibility and evidence grounding.

Abstract

Video-based Clinical Gait Analysis often suffers from poor generalization as models overfit environmental biases instead of capturing pathological motion. To address this, we propose BioGait-VLM, a tri-modal Vision-Language-Biomechanics framework for interpretable clinical gait assessment. Unlike standard video encoders, our architecture incorporates a Temporal Evidence Distillation branch to capture rhythmic dynamics and a Biomechanical Tokenization branch that projects 3D skeleton sequences into language-aligned semantic tokens. This enables the model to explicitly reason about joint mechanics independent of visual shortcuts. To ensure rigorous benchmarking, we augment the public GAVD dataset with a high-fidelity Degenerative Cervical Myelopathy (DCM) cohort to form a unified 8-class taxonomy, establishing a strict subject-disjoint protocol to prevent data leakage. Under this setting, BioGait-VLM achieves state-of-the-art recognition accuracy. Furthermore, a blinded expert study confirms that biomechanical tokens significantly improve clinical plausibility and evidence grounding, offering a path toward transparent, privacy-enhanced gait assessment.
Paper Structure (29 sections, 3 equations, 4 figures, 2 tables)

This paper contains 29 sections, 3 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Overview of BioGait-VLM. The framework integrates three streams: (Top) Temporal Evidence Distillation (TED) for rhythmic dynamics; (Middle) Visual Context; and (Bottom) Biomechanical Tokenization, projecting 3D kinematics into semantic tokens. Fused within a frozen LVLM, these modalities enable state-of-the-art classification and evidence-grounded clinical reporting.
  • Figure 2: DCM Dataset Acquisition Setup. (a) The recording environment at the outpatient clinic. The scene includes realistic background clutter (e.g., treadmills, medical equipment) rather than a sterile lab background, providing a rigorous testbed for evaluating model robustness against visual shortcuts. (b) Top-down schematic of the standardized capture protocol. A tripod-mounted RGB camera is positioned to capture the sagittal (side) view of the patient traversing the instrumented gait mat, ensuring consistent biomechanical visibility.
  • Figure 3: Qualitative Visualization of Interpretable Reasoning. Our BioGait-VLM generates detailed clinical rationales anchored in quantitative evidence. (Top) Myopathic: The model identifies "muscle weakness" via specific metrics, citing "high steps per minute (173.68)" and "increased double support time." (Middle) Abnormal: It detects potential coordination issues by calculating precise asymmetry: "left step time significantly shorter (0.5s vs. 0.81s)" and "imbalance in limb propulsion." (Bottom) Abnormal: It distinguishes a different irregularity by counting specific events, noting "more toe-offs on the right side (4) compared to the left (3)" and quantifying swing percentage. These examples demonstrate how Biomechanical Tokenization enables granular, evidence-based differentiation even within broad diagnostic categories.
  • Figure 4: Interface for Blinded Expert Evaluation. To ensure unbiased assessment, we developed a custom web-based annotation tool. Clinicians view the patient gait video alongside three anonymized and randomized model-generated reports (labeled Diagnosis A, B, C). Experts rate each rationale independently on four clinical dimensions using Likert scales before selecting the superior model, ensuring that rankings are based solely on clinical quality rather than model identity.