Confidence Calibration in Large Language Model-Based Entity Matching
Iris Kamsteeg, Juan Cardenas-Cartagena, Floris van Beers, Gineke ten Holt, Tsegaye Misikir Tashu, Matias Valdenegro-Toro
TL;DR
This work studies confidence calibration for RoBERTa-based Entity Matching (EM), a binary decision task that pairs records across sources. It benchmarks three calibration strategies—Temperature Scaling, Monte Carlo Dropout, and Ensembles—across six diverse EM datasets, evaluating both calibration quality via $ECE$ and predictive performance via $F_1$. The findings show baseline RoBERTa predictions are slightly overconfident, with $ECE$ ranging from $0.0041$ to $0.0552$, and that Temperature Scaling yields the most consistent $ECE$ reductions, up to $23.83\%$, without harming $F_1$. The results highlight that simple calibration can improve the reliability of EM predictions in practical deployments, while suggesting avenues for combining methods and extending to larger LLMs in future work.
Abstract
This research aims to explore the intersection of Large Language Models and confidence calibration in Entity Matching. To this end, we perform an empirical study to compare baseline RoBERTa confidences for an Entity Matching task against confidences that are calibrated using Temperature Scaling, Monte Carlo Dropout and Ensembles. We use the Abt-Buy, DBLP-ACM, iTunes-Amazon and Company datasets. The findings indicate that the proposed modified RoBERTa model exhibits a slight overconfidence, with Expected Calibration Error scores ranging from 0.0043 to 0.0552 across datasets. We find that this overconfidence can be mitigated using Temperature Scaling, reducing Expected Calibration Error scores by up to 23.83%.
