Table of Contents
Fetching ...

MvBody: Multi-View-Based Hybrid Transformer Using Optical 3D Body Scan for Explainable Cesarean Section Prediction

Ruting Cheng, Boyuan Feng, Yijiang Zheng, Chuhui Qiu, Aizierjiang Aiersilan, Joaquin A. Calderon, Wentao Zhao, Qing Pan, James K. Hahn

TL;DR

This work tackles CS risk prediction in resource-limited settings by leveraging self-reported medical data and 3D optical body scans through a novel multi-view Transformer, MvBody. The model employs a two-branch architecture with metric learning via Soft Margin Triplet-Center Loss and provides explanations using Integrated Gradients, achieving an independent-test AUC-ROC of $0.724$ and accuracy of $84.62\%$. Key findings indicate that pre-pregnancy weight, maternal age, obstetric history, prior CS, and head/shoulder body shapes are influential for CS risk; ablations demonstrate the importance of head/shoulder regions and early fusion. Overall, MvBody offers a potentially scalable, explainable approach for prenatal risk screening in home or community settings, with future work to enable smartphone-based 3D scanning and larger-scale validation.

Abstract

Accurately assessing the risk of cesarean section (CS) delivery is critical, especially in settings with limited medical resources, where access to healthcare is often restricted. Early and reliable risk prediction allows better-informed prenatal care decisions and can improve maternal and neonatal outcomes. However, most existing predictive models are tailored for in-hospital use during labor and rely on parameters that are often unavailable in resource-limited or home-based settings. In this study, we conduct a pilot investigation to examine the feasibility of using 3D body shape for CS risk assessment for future applications with more affordable general devices. We propose a novel multi-view-based Transformer network, MvBody, which predicts CS risk using only self-reported medical data and 3D optical body scans obtained between the 31st and 38th weeks of gestation. To enhance training efficiency and model generalizability in data-scarce environments, we incorporate a metric learning loss into the network. Compared to widely used machine learning models and the latest advanced 3D analysis methods, our method demonstrates superior performance, achieving an accuracy of 84.62% and an Area Under the Receiver Operating Characteristic Curve (AUC-ROC) of 0.724 on the independent test set. To improve transparency and trust in the model's predictions, we apply the Integrated Gradients algorithm to provide theoretically grounded explanations of the model's decision-making process. Our results indicate that pre-pregnancy weight, maternal age, obstetric history, previous CS history, and body shape, particularly around the head and shoulders, are key contributors to CS risk prediction.

MvBody: Multi-View-Based Hybrid Transformer Using Optical 3D Body Scan for Explainable Cesarean Section Prediction

TL;DR

This work tackles CS risk prediction in resource-limited settings by leveraging self-reported medical data and 3D optical body scans through a novel multi-view Transformer, MvBody. The model employs a two-branch architecture with metric learning via Soft Margin Triplet-Center Loss and provides explanations using Integrated Gradients, achieving an independent-test AUC-ROC of and accuracy of . Key findings indicate that pre-pregnancy weight, maternal age, obstetric history, prior CS, and head/shoulder body shapes are influential for CS risk; ablations demonstrate the importance of head/shoulder regions and early fusion. Overall, MvBody offers a potentially scalable, explainable approach for prenatal risk screening in home or community settings, with future work to enable smartphone-based 3D scanning and larger-scale validation.

Abstract

Accurately assessing the risk of cesarean section (CS) delivery is critical, especially in settings with limited medical resources, where access to healthcare is often restricted. Early and reliable risk prediction allows better-informed prenatal care decisions and can improve maternal and neonatal outcomes. However, most existing predictive models are tailored for in-hospital use during labor and rely on parameters that are often unavailable in resource-limited or home-based settings. In this study, we conduct a pilot investigation to examine the feasibility of using 3D body shape for CS risk assessment for future applications with more affordable general devices. We propose a novel multi-view-based Transformer network, MvBody, which predicts CS risk using only self-reported medical data and 3D optical body scans obtained between the 31st and 38th weeks of gestation. To enhance training efficiency and model generalizability in data-scarce environments, we incorporate a metric learning loss into the network. Compared to widely used machine learning models and the latest advanced 3D analysis methods, our method demonstrates superior performance, achieving an accuracy of 84.62% and an Area Under the Receiver Operating Characteristic Curve (AUC-ROC) of 0.724 on the independent test set. To improve transparency and trust in the model's predictions, we apply the Integrated Gradients algorithm to provide theoretically grounded explanations of the model's decision-making process. Our results indicate that pre-pregnancy weight, maternal age, obstetric history, previous CS history, and body shape, particularly around the head and shoulders, are key contributors to CS risk prediction.

Paper Structure

This paper contains 12 sections, 11 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Architecture of the proposed hybrid multi-view neural network. The upper stream illustrates the processing of numeric medical features, while the lower stream represents the processing of 3D body features with two-stage fusions of intermediate medical features. The detailed structure of the Transformer Block is shown in the upper-right box, and key vectors are labeled with distinct colors at the bottom for clarity.
  • Figure 2: Distributions of the most significant medical parameters' attributions.
  • Figure 3: Distribution of different projection tokens' attributions.
  • Figure 4: Pixel-level attributions of input projections. Grayscale intensity represents the magnitude of attribution at each pixel.