Table of Contents
Fetching ...

Domain-specific or Uncertainty-aware models: Does it really make a difference for biomedical text classification?

Aman Sinha, Timothee Mickus, Marianne Clausel, Mathieu Constant, Xavier Coubez

TL;DR

The paper investigates whether domain-specific pretraining and uncertainty-aware modeling jointly improve biomedical text classification. By benchmarking frequentist and Bayesian DNNs across six English and French biomedical datasets, and evaluating both classification and uncertainty metrics, the study shows that domain-specific models generally boost accuracy, while uncertainty-aware designs improve calibration; the combination often yields favorable entropy behavior in output distributions. However, the exact task strongly modulates these effects, meaning there is no one-size-fits-all solution. The framework highlights the practical value of considering predictive entropy and calibration in clinical settings, guiding practitioners to tailor model choices to specific diagnostic or decision-making tasks. Overall, domain specificity and uncertainty-awareness are compatible and beneficial under the right task conditions, with entropy-based diagnostics providing insight into prediction reliability.

Abstract

The success of pretrained language models (PLMs) across a spate of use-cases has led to significant investment from the NLP community towards building domain-specific foundational models. On the other hand, in mission critical settings such as biomedical applications, other aspects also factor in-chief of which is a model's ability to produce reasonable estimates of its own uncertainty. In the present study, we discuss these two desiderata through the lens of how they shape the entropy of a model's output probability distribution. We find that domain specificity and uncertainty awareness can often be successfully combined, but the exact task at hand weighs in much more strongly.

Domain-specific or Uncertainty-aware models: Does it really make a difference for biomedical text classification?

TL;DR

The paper investigates whether domain-specific pretraining and uncertainty-aware modeling jointly improve biomedical text classification. By benchmarking frequentist and Bayesian DNNs across six English and French biomedical datasets, and evaluating both classification and uncertainty metrics, the study shows that domain-specific models generally boost accuracy, while uncertainty-aware designs improve calibration; the combination often yields favorable entropy behavior in output distributions. However, the exact task strongly modulates these effects, meaning there is no one-size-fits-all solution. The framework highlights the practical value of considering predictive entropy and calibration in clinical settings, guiding practitioners to tailor model choices to specific diagnostic or decision-making tasks. Overall, domain specificity and uncertainty-awareness are compatible and beneficial under the right task conditions, with entropy-based diagnostics providing insight into prediction reliability.

Abstract

The success of pretrained language models (PLMs) across a spate of use-cases has led to significant investment from the NLP community towards building domain-specific foundational models. On the other hand, in mission critical settings such as biomedical applications, other aspects also factor in-chief of which is a model's ability to produce reasonable estimates of its own uncertainty. In the present study, we discuss these two desiderata through the lens of how they shape the entropy of a model's output probability distribution. We find that domain specificity and uncertainty awareness can often be successfully combined, but the exact task at hand weighs in much more strongly.
Paper Structure (26 sections, 6 equations, 5 figures, 5 tables)

This paper contains 26 sections, 6 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Illustration of this study's setup. We perform a systematic comparison of domain-specificity and uncertainty-awareness in the medical domain.
  • Figure 2: Performances for empirically best models (selected metrics), $z$-normalized per dataset. See \ref{['tab:final-all-seeds']} in \ref{['adx:sup results']} for full non-normalized results.
  • Figure 3: SHAP attributions. Variables are ordered by mean absolute SHAPs. In blue, weight assigned when the variable is negative; in red, when it is positive. 'ds.' denotes a categorical variable tracking the dataset.
  • Figure 4: Entropy vs. probability mass assigned to the target ($z$-normalized per classifier). Orange: correct predictions; Blue: incorrect.
  • Figure 5: Comparison of various BNN models for different datasets on classification task based on Macro-F1 on validation set.