Data inaccuracy quantification and uncertainty propagation for bibliometric indicators

Paul Donner

Data inaccuracy quantification and uncertainty propagation for bibliometric indicators

Paul Donner

TL;DR

This paper tackles the problem of uncertainty in bibliometric indicators arising from data errors by combining empirical error distributions with Bayesian regression and Monte Carlo simulation. It shows how to propagate uncertainty from base quantities, such as citation counts and document-type assignments, through measurement models to obtain probabilistic indicator values. Through synthetic simulations and a real-world application to chemistry research groups, the authors reveal substantial uncertainty in common metrics (e.g., MNCS) that would be hidden under point estimates. The work advocates explicit uncertainty reporting in bibliometrics, arguing that data quality issues can meaningfully affect interpretation and decision-making, in line with the Leiden Manifesto.

Abstract

This study introduces an approach to estimate the uncertainty in bibliometric indicator values that is caused by data errors. This approach utilizes Bayesian regression models, estimated from empirical data samples, which are used to predict error-free data. Through direct Monte Carlo simulation - drawing many replicates of predicted data from the estimated regression models for the same input data - probability distributions for indicator values can be obtained, which provide the information on their uncertainty due to data errors. It is demonstrated how uncertainty in base quantities, such as the number of publications of a unit of certain document types and the number of citations of a publication, can be propagated along a measurement model into final indicator values. Synthetic examples are used to illustrate the method and real bibliometric research evaluation data is used to show its application in practice. Though in this contribution we just use two out of a larger number of known bibliometric error categories and therefore can account for only some part of the total uncertainty due to inaccuracies, the latter example reveals that average values of citation impact scores of publications of research groups need to be used very cautiously as they often have large margins of error resulting from data inaccuracies.

Data inaccuracy quantification and uncertainty propagation for bibliometric indicators

TL;DR

Abstract

Paper Structure (17 sections, 5 figures, 11 tables)

This paper contains 17 sections, 5 figures, 11 tables.

Introduction
Related work
Uncertainty in scientometrics not related to data errors
Accuracy and completeness of citation links in bibliographic databases
Missing links: Empirical study of citation error distribution
Methods and data
Results
Propagation of uncertainty from data to bibliometric indicators -- a Bayesian regression approach
Incorporating information about uncertainty into statistical models
A simulation exercise of bibliometric uncertainty propagation -- Simulating error-free data
A real-world example application
Discussion
Summary
Limitations and future work
Conclusion
...and 2 more sections

Figures (5)

Figure 1: Scatterplot of WoS citation count and number of missing citations found
Figure 2: Estimates of normalized citations scores for a sample of publications, 1000 runs, symbols: $\times$ original error-affected data, $\bullet$ simulated error-free data
Figure 3: Estimated error-corrected data for publications (P), citations (C), and mean normalized citation score (MNCS) for 110 chemistry research groups, medians and 95% credible intervals, 1000 runs, symbols: $\times$ original error-affected data, $\bullet$ simulated error-free data
Figure 4: Relationship between P and uncertainty in MNCS, 110 chemistry research groups
Figure 5: Results simulation exercise A1 - data affected by citation count error

Data inaccuracy quantification and uncertainty propagation for bibliometric indicators

TL;DR

Abstract

Data inaccuracy quantification and uncertainty propagation for bibliometric indicators

Authors

TL;DR

Abstract

Table of Contents

Figures (5)