Table of Contents
Fetching ...

A multi-centre, multi-device benchmark dataset for landmark-based comprehensive fetal biometry

Chiara Di Vece, Zhehua Mao, Netanell Avisdris, Brian Dromey, Raffaele Napolitano, Dafna Ben Bashat, Francisco Vasconcelos, Danail Stoyanov, Leo Joskowicz, Sophia Bano

TL;DR

This is the first publicly available multi-centre, multi-device, landmark-annotated dataset that covers all primary fetal biometry measures, providing a robust benchmark for domain adaptation and multi-centre generalisation in fetal biometry and enabling more reliable AI-assisted fetal growth assessment across centres.

Abstract

Accurate fetal growth assessment from ultrasound (US) relies on precise biometry measured by manually identifying anatomical landmarks in standard planes. Manual landmarking is time-consuming, operator-dependent, and sensitive to variability across scanners and sites, limiting the reproducibility of automated approaches. There is a need for multi-source annotated datasets to develop artificial intelligence-assisted fetal growth assessment methods. To address this bottleneck, we present an open, multi-centre, multi-device benchmark dataset of fetal US images with expert anatomical landmark annotations for clinically used fetal biometric measurements. These measurements include head bi-parietal and occipito-frontal diameters, abdominal transverse and antero-posterior diameters, and femoral length. The dataset contains 4,513 de-identified US images from 1,904 subjects acquired at three clinical sites using seven different US devices. We provide standardised, subject-disjoint train/test splits, evaluation code, and baseline results to enable fair and reproducible comparison of methods. Using an automatic biometry model, we quantify domain shift and demonstrate that training and evaluation confined to a single centre substantially overestimate performance relative to multi-centre testing. To the best of our knowledge, this is the first publicly available multi-centre, multi-device, landmark-annotated dataset that covers all primary fetal biometry measures, providing a robust benchmark for domain adaptation and multi-centre generalisation in fetal biometry and enabling more reliable AI-assisted fetal growth assessment across centres. All data, annotations, training code, and evaluation pipelines are made publicly available.

A multi-centre, multi-device benchmark dataset for landmark-based comprehensive fetal biometry

TL;DR

This is the first publicly available multi-centre, multi-device, landmark-annotated dataset that covers all primary fetal biometry measures, providing a robust benchmark for domain adaptation and multi-centre generalisation in fetal biometry and enabling more reliable AI-assisted fetal growth assessment across centres.

Abstract

Accurate fetal growth assessment from ultrasound (US) relies on precise biometry measured by manually identifying anatomical landmarks in standard planes. Manual landmarking is time-consuming, operator-dependent, and sensitive to variability across scanners and sites, limiting the reproducibility of automated approaches. There is a need for multi-source annotated datasets to develop artificial intelligence-assisted fetal growth assessment methods. To address this bottleneck, we present an open, multi-centre, multi-device benchmark dataset of fetal US images with expert anatomical landmark annotations for clinically used fetal biometric measurements. These measurements include head bi-parietal and occipito-frontal diameters, abdominal transverse and antero-posterior diameters, and femoral length. The dataset contains 4,513 de-identified US images from 1,904 subjects acquired at three clinical sites using seven different US devices. We provide standardised, subject-disjoint train/test splits, evaluation code, and baseline results to enable fair and reproducible comparison of methods. Using an automatic biometry model, we quantify domain shift and demonstrate that training and evaluation confined to a single centre substantially overestimate performance relative to multi-centre testing. To the best of our knowledge, this is the first publicly available multi-centre, multi-device, landmark-annotated dataset that covers all primary fetal biometry measures, providing a robust benchmark for domain adaptation and multi-centre generalisation in fetal biometry and enabling more reliable AI-assisted fetal growth assessment across centres. All data, annotations, training code, and evaluation pipelines are made publicly available.

Paper Structure

This paper contains 19 sections, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Variability of anatomical structures across a) FP, b) HC18, and c) UCL datasets. Each row represents one anatomical region. Orientation: polar histogram with log density scale of measurement angle relative to horizontal axis ([0°,360°]); Position: 2D kernel density estimation of centre-point location (pos_x, pos_y $\in$ [0,1]); Size: 1D kernel density estimation of structure size normalised by image area (unitless). Substantial heterogeneity across datasets reflects realistic clinical variability and demonstrates the domain-shift problem.
  • Figure 2: Cross-dataset generalisation heatmaps for fetal biometry. Train$\rightarrow$Test NME (lower is better) for each biometric measurement. Rows denote the training dataset and columns the test dataset. For abdomen and femur, HC18 is omitted where results are unavailable in the cross-dataset evaluation table. Cell values report mean NME. Colour scales are shared within each anatomy group (head: BPD/OFD; abdomen: APAD/TAD; femur: FL).
  • Figure 3: Bland–Altman agreement plots between BiometryNet predictions and ground-truth fetal biometry measurements for a) FP, b) HC18, and c) UCL datasets. Horizontal axis: mean measurement (mm). Vertical axis: percent difference relative to the ground-truth measurement. Solid lines indicate the mean bias and dashed lines indicate the 95% limits of agreement (bias $\pm 1.96$ SD).
  • Figure 4: Absolute biometry error (mm) on the m-c test set for models trained on FP, HC18, UCL, and m-c datasets. Boxplots are shown for Head (BPD, OFD), Abdomen (TAD, APAD), and Femur (FL). The y-axis is truncated at 30 mm to improve visualisation of the central distribution.