Table of Contents
Fetching ...

Confidence Calibration in Large Language Model-Based Entity Matching

Iris Kamsteeg, Juan Cardenas-Cartagena, Floris van Beers, Gineke ten Holt, Tsegaye Misikir Tashu, Matias Valdenegro-Toro

TL;DR

This work studies confidence calibration for RoBERTa-based Entity Matching (EM), a binary decision task that pairs records across sources. It benchmarks three calibration strategies—Temperature Scaling, Monte Carlo Dropout, and Ensembles—across six diverse EM datasets, evaluating both calibration quality via $ECE$ and predictive performance via $F_1$. The findings show baseline RoBERTa predictions are slightly overconfident, with $ECE$ ranging from $0.0041$ to $0.0552$, and that Temperature Scaling yields the most consistent $ECE$ reductions, up to $23.83\%$, without harming $F_1$. The results highlight that simple calibration can improve the reliability of EM predictions in practical deployments, while suggesting avenues for combining methods and extending to larger LLMs in future work.

Abstract

This research aims to explore the intersection of Large Language Models and confidence calibration in Entity Matching. To this end, we perform an empirical study to compare baseline RoBERTa confidences for an Entity Matching task against confidences that are calibrated using Temperature Scaling, Monte Carlo Dropout and Ensembles. We use the Abt-Buy, DBLP-ACM, iTunes-Amazon and Company datasets. The findings indicate that the proposed modified RoBERTa model exhibits a slight overconfidence, with Expected Calibration Error scores ranging from 0.0043 to 0.0552 across datasets. We find that this overconfidence can be mitigated using Temperature Scaling, reducing Expected Calibration Error scores by up to 23.83%.

Confidence Calibration in Large Language Model-Based Entity Matching

TL;DR

This work studies confidence calibration for RoBERTa-based Entity Matching (EM), a binary decision task that pairs records across sources. It benchmarks three calibration strategies—Temperature Scaling, Monte Carlo Dropout, and Ensembles—across six diverse EM datasets, evaluating both calibration quality via and predictive performance via . The findings show baseline RoBERTa predictions are slightly overconfident, with ranging from to , and that Temperature Scaling yields the most consistent reductions, up to , without harming . The results highlight that simple calibration can improve the reliability of EM predictions in practical deployments, while suggesting avenues for combining methods and extending to larger LLMs in future work.

Abstract

This research aims to explore the intersection of Large Language Models and confidence calibration in Entity Matching. To this end, we perform an empirical study to compare baseline RoBERTa confidences for an Entity Matching task against confidences that are calibrated using Temperature Scaling, Monte Carlo Dropout and Ensembles. We use the Abt-Buy, DBLP-ACM, iTunes-Amazon and Company datasets. The findings indicate that the proposed modified RoBERTa model exhibits a slight overconfidence, with Expected Calibration Error scores ranging from 0.0043 to 0.0552 across datasets. We find that this overconfidence can be mitigated using Temperature Scaling, reducing Expected Calibration Error scores by up to 23.83%.

Paper Structure

This paper contains 25 sections, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Overview of this research's model (without any confidence calibration methods visualised), model input and model output. In addition to classifying each entry pair as a 'match’ or 'no match,’ the model also generates a score that should reflect the model's confidence in its prediction.
  • Figure 2: The mean confidence histograms over five runs for the Abt-Buy and Company datasets, using the baseline RoBERTa model predictions, on a logarithmic scale. The distribution of correct prediction values are in green; the distribution of incorrect prediction values are in red. The y-axis presents percentages of occurrences rather than absolute numbers of occurrences. Error bars denote standard deviations. ECE, MCE, and RMSCE values are reported to four decimal places. The same confidence histograms for the other four datasets are presented in Appendix \ref{['sec:roberta_confidence_histograms']}.
  • Figure 3: The mean confidence histograms over five runs for all datasets, using the baseline RoBERTa model predicted probabilities. The y-axis presents percentages of occurrences rather than absolute numbers of occurrences. Error bars denote standard deviations. ECE, MCE, and RMSCE values are reported to four decimal places.
  • Figure 4: The mean confidence histograms over five runs for the DBLP-ACM-Structured, DBLP-ACM-Dirty, iTunes-Amazon-Structured and iTunes-Amazon-Dirty datasets, using the baseline RoBERTa model predictions, on a logarithmic scale. The distribution of correct prediction values are in green; the distribution of incorrect prediction values are in red. The y-axis presents percentages of occurrences rather than absolute numbers of occurrences. Error bars denote standard deviations. ECE, MCE, and RMSCE values are reported to four decimal places.
  • Figure 5: The reliability diagrams using data from five runs for all datasets, using the baseline RoBERTA model predictions. ECE, MCE, and RMSCE values are reported to four decimal digits. Note that for some of the datasets, data is missing for certain predicted probability bins. This is because there were no predictions found within that bin. A diagonal is plotted to represent approximately perfect calibration.
  • ...and 3 more figures