Table of Contents
Fetching ...

Classification of Radiological Text in Small and Imbalanced Datasets in a Non-English Language

Vincent Beliveau, Helene Kaas, Martin Prener, Claes N. Ladefoged, Desmond Elliott, Gitte M. Knudsen, Lars H. Pinborg, Melanie Ganz

TL;DR

This study addresses the challenge of classifying radiology-text in small, imbalanced datasets in a non-English language (Danish) by comparing BERT-like transformers, SetFit, and prompted LLMs on epilepsy-related MRI reports. Using 16,899 Danish reports with labels for focal cortical dysplasia, mesial temporal sclerosis, and hippocampal abnormalities, the authors show that domain-pretrained BERT-like models consistently outperform SetFit and LLMs, with DanskBERT delivering the strongest results for two labels and XLM-RoBERTa performing best for the third. Pretraining on a large radiology corpus yields performance gains, but none of the models reach fully automated, expert-level accuracy, suggesting a role for these models in data filtering rather than end-to-end labeling. The work highlights the importance of language- and domain-specific resources for non-English radiology NLP and notes practical considerations, including translation quality and data privacy, for deploying such models in real-world clinical settings.

Abstract

Natural language processing (NLP) in the medical domain can underperform in real-world applications involving small datasets in a non-English language with few labeled samples and imbalanced classes. There is yet no consensus on how to approach this problem. We evaluated a set of NLP models including BERT-like transformers, few-shot learning with sentence transformers (SetFit), and prompted large language models (LLM), using three datasets of radiology reports on magnetic resonance images of epilepsy patients in Danish, a low-resource language. Our results indicate that BERT-like models pretrained in the target domain of radiology reports currently offer the optimal performances for this scenario. Notably, the SetFit and LLM models underperformed compared to BERT-like models, with LLM performing the worst. Importantly, none of the models investigated was sufficiently accurate to allow for text classification without any supervision. However, they show potential for data filtering, which could reduce the amount of manual labeling required.

Classification of Radiological Text in Small and Imbalanced Datasets in a Non-English Language

TL;DR

This study addresses the challenge of classifying radiology-text in small, imbalanced datasets in a non-English language (Danish) by comparing BERT-like transformers, SetFit, and prompted LLMs on epilepsy-related MRI reports. Using 16,899 Danish reports with labels for focal cortical dysplasia, mesial temporal sclerosis, and hippocampal abnormalities, the authors show that domain-pretrained BERT-like models consistently outperform SetFit and LLMs, with DanskBERT delivering the strongest results for two labels and XLM-RoBERTa performing best for the third. Pretraining on a large radiology corpus yields performance gains, but none of the models reach fully automated, expert-level accuracy, suggesting a role for these models in data filtering rather than end-to-end labeling. The work highlights the importance of language- and domain-specific resources for non-English radiology NLP and notes practical considerations, including translation quality and data privacy, for deploying such models in real-world clinical settings.

Abstract

Natural language processing (NLP) in the medical domain can underperform in real-world applications involving small datasets in a non-English language with few labeled samples and imbalanced classes. There is yet no consensus on how to approach this problem. We evaluated a set of NLP models including BERT-like transformers, few-shot learning with sentence transformers (SetFit), and prompted large language models (LLM), using three datasets of radiology reports on magnetic resonance images of epilepsy patients in Danish, a low-resource language. Our results indicate that BERT-like models pretrained in the target domain of radiology reports currently offer the optimal performances for this scenario. Notably, the SetFit and LLM models underperformed compared to BERT-like models, with LLM performing the worst. Importantly, none of the models investigated was sufficiently accurate to allow for text classification without any supervision. However, they show potential for data filtering, which could reduce the amount of manual labeling required.
Paper Structure (16 sections, 4 figures, 1 table)

This paper contains 16 sections, 4 figures, 1 table.

Figures (4)

  • Figure 1: An example short radiology reports describing a patient with focal cortical dysplasia (FCD). Dates were anonymized for presentation purposes.
  • Figure 2: Overview of the data extraction, labeling and preprocessing. FCD: focal cortical dysplasia, MTS: mesial temporal sclerosis, HA: hippocalpal abnormality, PACS: picture archiving and communication system.
  • Figure 3: Confusion matrices of selected classifiers on the FCD test dataset. Recall is /(+). A: No FCD, B: Potential FCD, C: Highly Probable FCD, D: FCD
  • Figure 4: An example long radiology reports describing a patient with focal cortical dysplasia (FCD). Dates were anonymized for presentation purposes.