Beyond Benchmarks of IUGC: Rethinking Requirements of Deep Learning Methods for Intrapartum Ultrasound Biometry from Fetal Ultrasound Videos

Jieyun Bai; Zihao Zhou; Yitong Tang; Jie Gan; Zhuonan Liang; Jianan Fan; Lisa B. Mcguire; Jillian L. Clarke; Weidong Cai; Jacaueline Spurway; Yubo Tang; Shiye Wang; Wenda Shen; Wangwang Yu; Yihao Li; Philippe Zhang; Weili Jiang; Yongjie Li; Salem Muhsin Ali Binqahal Al Nasim; Arsen Abzhanov; Numan Saeed; Mohammad Yaqub; Zunhui Xian; Hongxing Lin; Libin Lan; Jayroop Ramesh; Valentin Bacher; Mark Eid; Hoda Kalabizadeh; Christian Rupprecht; Ana I. L. Namburete; Pak-Hei Yeung; Madeleine K. Wyburd; Nicola K. Dinsdale; Assanali Serikbey; Jiankai Li; Sung-Liang Chen; Zicheng Hu; Nana Liu; Yian Deng; Wei Hu; Cong Tan; Wenfeng Zhang; Mai Tuyet Nhi; Gregor Koehler; Rapheal Stock; Klaus Maier-Hein; Marawan Elbatel; Xiaomeng Li; Saad Slimani; Victor M. Campello; Benard Ohene-Botwe; Isaac Khobo; Yuxin Huang; Zhenyan Han; Hongying Hou; Di Qiu; Zheng Zheng; Gongning Luo; Dong Ni; Yaosheng Lu; Karim Lekadir; Shuo Li

Beyond Benchmarks of IUGC: Rethinking Requirements of Deep Learning Methods for Intrapartum Ultrasound Biometry from Fetal Ultrasound Videos

Jieyun Bai, Zihao Zhou, Yitong Tang, Jie Gan, Zhuonan Liang, Jianan Fan, Lisa B. Mcguire, Jillian L. Clarke, Weidong Cai, Jacaueline Spurway, Yubo Tang, Shiye Wang, Wenda Shen, Wangwang Yu, Yihao Li, Philippe Zhang, Weili Jiang, Yongjie Li, Salem Muhsin Ali Binqahal Al Nasim, Arsen Abzhanov, Numan Saeed, Mohammad Yaqub, Zunhui Xian, Hongxing Lin, Libin Lan, Jayroop Ramesh, Valentin Bacher, Mark Eid, Hoda Kalabizadeh, Christian Rupprecht, Ana I. L. Namburete, Pak-Hei Yeung, Madeleine K. Wyburd, Nicola K. Dinsdale, Assanali Serikbey, Jiankai Li, Sung-Liang Chen, Zicheng Hu, Nana Liu, Yian Deng, Wei Hu, Cong Tan, Wenfeng Zhang, Mai Tuyet Nhi, Gregor Koehler, Rapheal Stock, Klaus Maier-Hein, Marawan Elbatel, Xiaomeng Li, Saad Slimani, Victor M. Campello, Benard Ohene-Botwe, Isaac Khobo, Yuxin Huang, Zhenyan Han, Hongying Hou, Di Qiu, Zheng Zheng, Gongning Luo, Dong Ni, Yaosheng Lu, Karim Lekadir, Shuo Li

TL;DR

A comprehensive overview of the Intrapartum Ultrasound Grand Challenge design is presented, the submissions from eight participating teams are reviewed, and a systematic analysis of the benchmark results is performed to identify key bottlenecks, explore potential solutions, and highlight open challenges for future research.

Abstract

A substantial proportion (45\%) of maternal deaths, neonatal deaths, and stillbirths occur during the intrapartum phase, with a particularly high burden in low- and middle-income countries. Intrapartum biometry plays a critical role in monitoring labor progression; however, the routine use of ultrasound in resource-limited settings is hindered by a shortage of trained sonographers. To address this challenge, the Intrapartum Ultrasound Grand Challenge (IUGC), co-hosted with MICCAI 2024, was launched. The IUGC introduces a clinically oriented multi-task automatic measurement framework that integrates standard plane classification, fetal head-pubic symphysis segmentation, and biometry, enabling algorithms to exploit complementary task information for more accurate estimation. Furthermore, the challenge releases the largest multi-center intrapartum ultrasound video dataset to date, comprising 774 videos (68,106 frames) collected from three hospitals, providing a robust foundation for model training and evaluation. In this study, we present a comprehensive overview of the challenge design, review the submissions from eight participating teams, and analyze their methods from five perspectives: preprocessing, data augmentation, learning strategy, model architecture, and post-processing. In addition, we perform a systematic analysis of the benchmark results to identify key bottlenecks, explore potential solutions, and highlight open challenges for future research. Although encouraging performance has been achieved, our findings indicate that the field remains at an early stage, and further in-depth investigation is required before large-scale clinical deployment. All benchmark solutions and the complete dataset have been publicly released to facilitate reproducible research and promote continued advances in automatic intrapartum ultrasound biometry.

Beyond Benchmarks of IUGC: Rethinking Requirements of Deep Learning Methods for Intrapartum Ultrasound Biometry from Fetal Ultrasound Videos

TL;DR

Abstract

Paper Structure (50 sections, 14 figures, 11 tables)

This paper contains 50 sections, 14 figures, 11 tables.

Introduction
Clinical Background
Challenges
Motivation
Related Work
Related Challenges and Benchmarks
State-of-the-art Intrapartum Biometric Measurements
Materials and Setup
The IUGC Challenge
Organization
Registration and Submission
Participants
Dataset and Evaluation
Dataset
Evaluation Metrics
...and 35 more sections

Figures (14)

Figure 1: Challenges faced by each task in the Intrapartum Ultrasound Grand Challenge (IUGC). Standard plane classification is affected by large intra-class variability caused by imaging artifacts, soft tissue deformation, fetal posture changes, and probe motion, as well as low inter-class separability due to similar echogenic patterns. Automatic segmentation of the fetal head (FH) and pubic symphysis (PS) is further complicated by labor-induced anatomical deformation, the small size of the PS, and ultrasound-specific noise, shadowing, and boundary ambiguity. Biometry estimation (AoP and HSD) requires precise geometric relationships between FH and PS, but multiple valid landmark candidates and fragmented segmentation outputs increase measurement uncertainty and algorithmic complexity.
Figure 2: Overall Workflow of Clinical Image Utilization and the Intrapartum Ultrasound Grand Challenge (IUGC). A) Clinical images are acquired via a transperineal ultrasound (US) approach using mid-sagittal scans from pregnant women during labor. B) Manual operations encompass: classification of standard and non-standard planes from US videos, segmentation of the Pubic Symphysis (PS) and Fetal Head (FH) in standard plane images, and measurement of biometric parameters, namely the angle of progression (AoP) and head - symphysis distance (HSD), based on landmark annotations on the segmented results. C) In the IUGC Challenge: Dataset distribution: A total of 774 videos (68,106 images) were categorized. Specifically, 434 videos (56,571 images) were for training, 40 videos (2,870 images) for validation, and 300 videos (8,665 images) for final evaluation. The training dataset was available to all registered participants. The validation set was employed to optimize model performance during the training process, while the test set was utilized for the conclusive evaluation of these methodologies. The top eight algorithms, along with their source codes, were ranked according to classification, segmentation, and measurement metrics. D) Precise measurement of biometric parameters offers vital information for evaluating labor progression and predicting the mode of delivery.
Figure 3: Data sources and distribution of the ultrasound video dataset used in the IUGC challenge. A) Representative cases from each contributing hospital within the ultrasound image dataset are depicted. The First Affiliated Hospital of Jinan University (JNU) provided 560 videos with 61,924 images; the Third Affiliated Hospital of Sun Yat-sen University (SYSU) contributed 121 videos with 3,494 images; and the Zhujiang Hospital of Southern Medical University (SMU) supplied 93 videos with 2,688 images. B) Distribution of different source data across training, testing, and validation datasets (number of videos/amount of annotated data) for the classification task. C) Distribution of different source data for the segmentation task. D) Distribution of different source data for the biometric parameter measurement task.
Figure 4: Evaluation results of eight teams' methods in the classification task based on ACC (first column), AUC (second column), F1 Score (third column), and MCC (fourth column). (A) Dot plots and boxplots for visualizing the evaluation metric data separately for each algorithm. (B) Blob plots for visualizing ranking stability based on bootstrap sampling. (C) Significance maps for visualizing the results of significance testing. (D) Line plots for visualizing rankings robustness across different ranking methods. See Section 3.3.2 for details.
Figure 5: Evaluation results of eight teams' methods in the segmentation task based on DSC (first column), ASSD (second column), and HD (third column). (A) Dot plots and boxplots for visualizing the evaluation metric data separately for each algorithm. (B) Blob plots for visualizing ranking stability based on bootstrap sampling. (C) Significance maps for visualizing the results of significance testing. (D) Line plots for visualizing rankings robustness across different ranking methods. See Section 3.3.2 for details.
...and 9 more figures

Beyond Benchmarks of IUGC: Rethinking Requirements of Deep Learning Methods for Intrapartum Ultrasound Biometry from Fetal Ultrasound Videos

TL;DR

Abstract

Beyond Benchmarks of IUGC: Rethinking Requirements of Deep Learning Methods for Intrapartum Ultrasound Biometry from Fetal Ultrasound Videos

Authors

TL;DR

Abstract

Table of Contents

Figures (14)