Clinical-Prior Guided Multi-Modal Learning with Latent Attention Pooling for Gait-Based Scoliosis Screening

Dong Chen; Zizhuang Wei; Jialei Xu; Xinyang Sun; Zonglin He; Meiru An; Huili Peng; Yong Hu; Kenneth MC Cheung

Clinical-Prior Guided Multi-Modal Learning with Latent Attention Pooling for Gait-Based Scoliosis Screening

Dong Chen, Zizhuang Wei, Jialei Xu, Xinyang Sun, Zonglin He, Meiru An, Huili Peng, Yong Hu, Kenneth MC Cheung

TL;DR

AIS screening suffers from subjectivity, radiographic risks, and scalability limits. The authors introduce ScoliGait, a data leakage-free gait video benchmark with 1,572 training clips from 550 participants and a 300-clip independent test set, each annotated with radiographic Cobb angle ($CA$) and clinical text prompts, plus a clinical-prior-guided kinematic knowledge map. A latent attention pooling mechanism fuses knowledge map, video, and text encodings to produce interpretable, multimodal representations, achieving 70.0% accuracy and 61.9% F1 on the subject-independent test, with the knowledge map alone outperforming video. This framework delivers robust, interpretable, and scalable non-invasive AIS assessment, bridging clinical insight with advanced multimodal learning.

Abstract

Adolescent Idiopathic Scoliosis (AIS) is a prevalent spinal deformity whose progression can be mitigated through early detection. Conventional screening methods are often subjective, difficult to scale, and reliant on specialized clinical expertise. Video-based gait analysis offers a promising alternative, but current datasets and methods frequently suffer from data leakage, where performance is inflated by repeated clips from the same individual, or employ oversimplified models that lack clinical interpretability. To address these limitations, we introduce ScoliGait, a new benchmark dataset comprising 1,572 gait video clips for training and 300 fully independent clips for testing. Each clip is annotated with radiographic Cobb angles and descriptive text based on clinical kinematic priors. We propose a multi-modal framework that integrates a clinical-prior-guided kinematic knowledge map for interpretable feature representation, alongside a latent attention pooling mechanism to fuse video, text, and knowledge map modalities. Our method establishes a new state-of-the-art, demonstrating a significant performance gap on a realistic, non-repeating subject benchmark. Our approach establishes a new state of the art, showing a significant performance gain on a realistic, subject-independent benchmark. This work provides a robust, interpretable, and clinically grounded foundation for scalable, non-invasive AIS assessment.

Clinical-Prior Guided Multi-Modal Learning with Latent Attention Pooling for Gait-Based Scoliosis Screening

TL;DR

) and clinical text prompts, plus a clinical-prior-guided kinematic knowledge map. A latent attention pooling mechanism fuses knowledge map, video, and text encodings to produce interpretable, multimodal representations, achieving 70.0% accuracy and 61.9% F1 on the subject-independent test, with the knowledge map alone outperforming video. This framework delivers robust, interpretable, and scalable non-invasive AIS assessment, bridging clinical insight with advanced multimodal learning.

Abstract

Paper Structure (11 sections, 3 figures, 3 tables)

This paper contains 11 sections, 3 figures, 3 tables.

Introduction
Related Works
Methods
Dataset Preparation
Kinematic Knowledge Map
Model Architecture and Latent Attention Fusion
Results
Scoliosis Screening Task
Explainability
Ablation Studies
Conclusions

Figures (3)

Figure 1: ScoliGait system for multi-modal gait analysis from mobile video. Left: temporal alignment of the knowledge map and video. Right: generation of video, knowledge map, and text modalities via pose estimation, showing kinematic alignment and knowledge-guided synthesis.
Figure 2: Proposed three-modal fusion architecture for AIS screening. Inputs from Knowledge Map, Vision, and Text modalities are integrated via a Latent Attention Pooling mechanism (bottom). Remapped attention scores from the Knowledge Map (top) are filtered for salient features to enable clinical interpretation.
Figure 3: Comparison of interpretability methods. Left: The proposed kinematic knowledge map (top) with aligned video (bottom), where axes are time vs. clinical variables. Right: Explicit kinematic features extracted from the map (top) versus conventional pose-based attention heatmaps (bottom). Our method provides explicit clinical insights beyond spatial saliency.

Clinical-Prior Guided Multi-Modal Learning with Latent Attention Pooling for Gait-Based Scoliosis Screening

TL;DR

Abstract

Clinical-Prior Guided Multi-Modal Learning with Latent Attention Pooling for Gait-Based Scoliosis Screening

Authors

TL;DR

Abstract

Table of Contents

Figures (3)