Table of Contents
Fetching ...

Text-Guided Multi-Instance Learning for Scoliosis Screening via Gait Video Analysis

Haiqing Li, Yuzhi Guo, Feng Jiang, Thao M. Dang, Hehuan Ma, Qifeng Zhou, Jean Gao, Junzhou Huang

TL;DR

This work tackles non-invasive scoliosis screening by analyzing gait videos to avoid radiation risks, introducing TG-MILNet—an integrated Text-Guided Multi-Instance Learning framework. The method combines DTW-based gait phase clustering, Inter-Bag Temporal Attention for cross-phase fusion, and a Boundary-Aware Model augmented with textual guidance from domain experts and GPT-4o to improve interpretability and sensitivity to borderline cases. On the large-scale Scoliosis1K dataset, TG-MILNet achieves state-of-the-art accuracy ($89.9\%$) with high sensitivity ($99.5\%$) and demonstrates strong robustness under severe class imbalance, particularly for the neutral/borderline class. The approach offers a scalable, non-invasive screening tool with practical impact for early scoliosis detection and potential applicability to other imbalanced medical tasks.

Abstract

Early-stage scoliosis is often difficult to detect, particularly in adolescents, where delayed diagnosis can lead to serious health issues. Traditional X-ray-based methods carry radiation risks and rely heavily on clinical expertise, limiting their use in large-scale screenings. To overcome these challenges, we propose a Text-Guided Multi-Instance Learning Network (TG-MILNet) for non-invasive scoliosis detection using gait videos. To handle temporal misalignment in gait sequences, we employ Dynamic Time Warping (DTW) clustering to segment videos into key gait phases. To focus on the most relevant diagnostic features, we introduce an Inter-Bag Temporal Attention (IBTA) mechanism that highlights critical gait phases. Recognizing the difficulty in identifying borderline cases, we design a Boundary-Aware Model (BAM) to improve sensitivity to subtle spinal deviations. Additionally, we incorporate textual guidance from domain experts and large language models (LLM) to enhance feature representation and improve model interpretability. Experiments on the large-scale Scoliosis1K gait dataset show that TG-MILNet achieves state-of-the-art performance, particularly excelling in handling class imbalance and accurately detecting challenging borderline cases. The code is available at https://github.com/lhqqq/TG-MILNet

Text-Guided Multi-Instance Learning for Scoliosis Screening via Gait Video Analysis

TL;DR

This work tackles non-invasive scoliosis screening by analyzing gait videos to avoid radiation risks, introducing TG-MILNet—an integrated Text-Guided Multi-Instance Learning framework. The method combines DTW-based gait phase clustering, Inter-Bag Temporal Attention for cross-phase fusion, and a Boundary-Aware Model augmented with textual guidance from domain experts and GPT-4o to improve interpretability and sensitivity to borderline cases. On the large-scale Scoliosis1K dataset, TG-MILNet achieves state-of-the-art accuracy () with high sensitivity () and demonstrates strong robustness under severe class imbalance, particularly for the neutral/borderline class. The approach offers a scalable, non-invasive screening tool with practical impact for early scoliosis detection and potential applicability to other imbalanced medical tasks.

Abstract

Early-stage scoliosis is often difficult to detect, particularly in adolescents, where delayed diagnosis can lead to serious health issues. Traditional X-ray-based methods carry radiation risks and rely heavily on clinical expertise, limiting their use in large-scale screenings. To overcome these challenges, we propose a Text-Guided Multi-Instance Learning Network (TG-MILNet) for non-invasive scoliosis detection using gait videos. To handle temporal misalignment in gait sequences, we employ Dynamic Time Warping (DTW) clustering to segment videos into key gait phases. To focus on the most relevant diagnostic features, we introduce an Inter-Bag Temporal Attention (IBTA) mechanism that highlights critical gait phases. Recognizing the difficulty in identifying borderline cases, we design a Boundary-Aware Model (BAM) to improve sensitivity to subtle spinal deviations. Additionally, we incorporate textual guidance from domain experts and large language models (LLM) to enhance feature representation and improve model interpretability. Experiments on the large-scale Scoliosis1K gait dataset show that TG-MILNet achieves state-of-the-art performance, particularly excelling in handling class imbalance and accurately detecting challenging borderline cases. The code is available at https://github.com/lhqqq/TG-MILNet

Paper Structure

This paper contains 13 sections, 5 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Illustrations of Cobb angle calculation for scoliosis assessment: (a) X-ray measurement, (b) schematic representation, and (c) scoliosis progression..
  • Figure 2: The TG-MILNet flowchart includes: (1) DTW-based clustering that segments video frames into temporal bags, enabling phase-specific feature extraction, (2) IBTA mechanism that prioritizes informative gait phases and filters irrelevant variations, and (3) a dual-branch classification framework addressing data imbalance and borderline cases. Expert and GPT-4-based textual guidance enhances interpretability by emphasizing scoliosis-related gait patterns.
  • Figure 3: Confusion matrices of ScoNet-MT and TG-MILNet under different class imbalance.
  • Figure 4: t-SNE Visualization on an Imbalanced Dataset.