Table of Contents
Fetching ...

Large language models surpass domain-specific architectures for antepartum electronic fetal monitoring analysis

Sheng Wong, Ravi Shankar, Beth Albert, Gabriel Davis Jones

TL;DR

This work addresses the need for robust automated antepartum CTG analysis by benchmarking a broad spectrum of architectures—from domain-specific DL models to time-series foundation models and fine-tuned large language models—on a unified CTG classification task using 2,500 real-world 20-minute recordings. The study demonstrates that fine-tuned LLMs achieve state-of-the-art discrimination (peaking at an AUC of 0.852 with Llama-3B text input) but require substantial computational resources, while convolutional and CL-embedding variants offer greater robustness under data scarcity or missing UA. Temporal dependency analyses reveal that some models rely on long-range sequence information, whereas others lever distributional signal properties, influencing performance under perturbations. The findings provide practical guidance for deploying CTG analysis systems, highlighting a trade-off between predictive performance and computational efficiency and advocating prospective multicenter validation and decision-support use rather than model replacement of clinicians.

Abstract

Foundation models (FMs) and large language models (LLMs) have demonstrated promising generalization across diverse domains for time-series analysis, yet their potential for electronic fetal monitoring (EFM) and cardiotocography (CTG) analysis remains underexplored. Most existing CTG studies relied on domain-specific models and lack systematic comparisons with modern foundation or language models, limiting our understanding of whether these models can outperform specialized systems in fetal health assessment. In this study, we present the first comprehensive benchmark of state-of-the-art architectures for automated antepartum CTG classification. Over 2,500 20-minutes recordings were used to evaluate over 15 models spanning domain-specific, time-series, foundation, and language-model categories under a unified framework. Fine-tuned LLMs consistently outperformed both foundation and domain-specific models across data-availability scenarios, except when uterine-activity signals were absent, where domain-specific models showed greater robustness. These performance gains, however, required substantially higher computational resources. Our results highlight that while fine-tuned LLMs achieved state-of-the-art performance for CTG classification, practical deployment must balance performance with computational efficiency.

Large language models surpass domain-specific architectures for antepartum electronic fetal monitoring analysis

TL;DR

This work addresses the need for robust automated antepartum CTG analysis by benchmarking a broad spectrum of architectures—from domain-specific DL models to time-series foundation models and fine-tuned large language models—on a unified CTG classification task using 2,500 real-world 20-minute recordings. The study demonstrates that fine-tuned LLMs achieve state-of-the-art discrimination (peaking at an AUC of 0.852 with Llama-3B text input) but require substantial computational resources, while convolutional and CL-embedding variants offer greater robustness under data scarcity or missing UA. Temporal dependency analyses reveal that some models rely on long-range sequence information, whereas others lever distributional signal properties, influencing performance under perturbations. The findings provide practical guidance for deploying CTG analysis systems, highlighting a trade-off between predictive performance and computational efficiency and advocating prospective multicenter validation and decision-support use rather than model replacement of clinicians.

Abstract

Foundation models (FMs) and large language models (LLMs) have demonstrated promising generalization across diverse domains for time-series analysis, yet their potential for electronic fetal monitoring (EFM) and cardiotocography (CTG) analysis remains underexplored. Most existing CTG studies relied on domain-specific models and lack systematic comparisons with modern foundation or language models, limiting our understanding of whether these models can outperform specialized systems in fetal health assessment. In this study, we present the first comprehensive benchmark of state-of-the-art architectures for automated antepartum CTG classification. Over 2,500 20-minutes recordings were used to evaluate over 15 models spanning domain-specific, time-series, foundation, and language-model categories under a unified framework. Fine-tuned LLMs consistently outperformed both foundation and domain-specific models across data-availability scenarios, except when uterine-activity signals were absent, where domain-specific models showed greater robustness. These performance gains, however, required substantially higher computational resources. Our results highlight that while fine-tuned LLMs achieved state-of-the-art performance for CTG classification, practical deployment must balance performance with computational efficiency.

Paper Structure

This paper contains 16 sections, 6 figures, 1 table.

Figures (6)

  • Figure 1: Average AUC performance for all models
  • Figure 2: Average accuracy for all models
  • Figure 3: Average AUC performance across all models trained with limited training data
  • Figure 4: Average AUC performance across all models without UA
  • Figure 5: Average AUC performance across all models with temporal shuffling
  • ...and 1 more figures