Data inaccuracy quantification and uncertainty propagation for bibliometric indicators
Paul Donner
TL;DR
This paper tackles the problem of uncertainty in bibliometric indicators arising from data errors by combining empirical error distributions with Bayesian regression and Monte Carlo simulation. It shows how to propagate uncertainty from base quantities, such as citation counts and document-type assignments, through measurement models to obtain probabilistic indicator values. Through synthetic simulations and a real-world application to chemistry research groups, the authors reveal substantial uncertainty in common metrics (e.g., MNCS) that would be hidden under point estimates. The work advocates explicit uncertainty reporting in bibliometrics, arguing that data quality issues can meaningfully affect interpretation and decision-making, in line with the Leiden Manifesto.
Abstract
This study introduces an approach to estimate the uncertainty in bibliometric indicator values that is caused by data errors. This approach utilizes Bayesian regression models, estimated from empirical data samples, which are used to predict error-free data. Through direct Monte Carlo simulation - drawing many replicates of predicted data from the estimated regression models for the same input data - probability distributions for indicator values can be obtained, which provide the information on their uncertainty due to data errors. It is demonstrated how uncertainty in base quantities, such as the number of publications of a unit of certain document types and the number of citations of a publication, can be propagated along a measurement model into final indicator values. Synthetic examples are used to illustrate the method and real bibliometric research evaluation data is used to show its application in practice. Though in this contribution we just use two out of a larger number of known bibliometric error categories and therefore can account for only some part of the total uncertainty due to inaccuracies, the latter example reveals that average values of citation impact scores of publications of research groups need to be used very cautiously as they often have large margins of error resulting from data inaccuracies.
