Table of Contents
Fetching ...

To Predict or Not to Predict? Towards reliable uncertainty estimation in the presence of noise

Nouran Khallaf, Serge Sharoff

TL;DR

Results indicate that while methods relying on softmax outputs remain competitive in high-resource in-domain settings, their reliability declines in low-resource or domain-shift scenarios, and Monte Carlo dropout approaches demonstrate consistently strong performance across all languages.

Abstract

This study examines the role of uncertainty estimation (UE) methods in multilingual text classification under noisy and non-topical conditions. Using a complex-vs-simple sentence classification task across several languages, we evaluate a range of UE techniques against a range of metrics to assess their contribution to making more robust predictions. Results indicate that while methods relying on softmax outputs remain competitive in high-resource in-domain settings, their reliability declines in low-resource or domain-shift scenarios. In contrast, Monte Carlo dropout approaches demonstrate consistently strong performance across all languages, offering more robust calibration, stable decision thresholds, and greater discriminative power even under adverse conditions. We further demonstrate the positive impact of UE on non-topical classification: abstaining from predicting the 10\% most uncertain instances increases the macro F1 score from 0.81 to 0.85 in the Readme task. By integrating UE with trustworthiness metrics, this study provides actionable insights for developing more reliable NLP systems in real-world multilingual environments. See https://github.com/Nouran-Khallaf/To-Predict-or-Not-to-Predict

To Predict or Not to Predict? Towards reliable uncertainty estimation in the presence of noise

TL;DR

Results indicate that while methods relying on softmax outputs remain competitive in high-resource in-domain settings, their reliability declines in low-resource or domain-shift scenarios, and Monte Carlo dropout approaches demonstrate consistently strong performance across all languages.

Abstract

This study examines the role of uncertainty estimation (UE) methods in multilingual text classification under noisy and non-topical conditions. Using a complex-vs-simple sentence classification task across several languages, we evaluate a range of UE techniques against a range of metrics to assess their contribution to making more robust predictions. Results indicate that while methods relying on softmax outputs remain competitive in high-resource in-domain settings, their reliability declines in low-resource or domain-shift scenarios. In contrast, Monte Carlo dropout approaches demonstrate consistently strong performance across all languages, offering more robust calibration, stable decision thresholds, and greater discriminative power even under adverse conditions. We further demonstrate the positive impact of UE on non-topical classification: abstaining from predicting the 10\% most uncertain instances increases the macro F1 score from 0.81 to 0.85 in the Readme task. By integrating UE with trustworthiness metrics, this study provides actionable insights for developing more reliable NLP systems in real-world multilingual environments. See https://github.com/Nouran-Khallaf/To-Predict-or-Not-to-Predict
Paper Structure (41 sections, 2 figures, 11 tables)

This paper contains 41 sections, 2 figures, 11 tables.

Figures (2)

  • Figure 1: Cross-language average $z$-scores after applying a direction-aware benefit transform across languages. Cell colors indicate relative performance (better $\rightarrow$ warmer), and horizontal whiskers show the standard deviation of $z$ across languages.
  • Figure 2: Improvement measured via $\Delta$F1 over the baseline from Table \ref{['tab:full_metrics_with_accuracy']} across rejection thresholds (Columns for 1, 5, 10, 15%) per language per UE score. A solid black box marks the best UE score per column and a dash–dot box marks the second best. The UE scores are labelled on the right.