Table of Contents
Fetching ...

Data inaccuracy quantification and uncertainty propagation for bibliometric indicators

Paul Donner

TL;DR

This paper tackles the problem of uncertainty in bibliometric indicators arising from data errors by combining empirical error distributions with Bayesian regression and Monte Carlo simulation. It shows how to propagate uncertainty from base quantities, such as citation counts and document-type assignments, through measurement models to obtain probabilistic indicator values. Through synthetic simulations and a real-world application to chemistry research groups, the authors reveal substantial uncertainty in common metrics (e.g., MNCS) that would be hidden under point estimates. The work advocates explicit uncertainty reporting in bibliometrics, arguing that data quality issues can meaningfully affect interpretation and decision-making, in line with the Leiden Manifesto.

Abstract

This study introduces an approach to estimate the uncertainty in bibliometric indicator values that is caused by data errors. This approach utilizes Bayesian regression models, estimated from empirical data samples, which are used to predict error-free data. Through direct Monte Carlo simulation - drawing many replicates of predicted data from the estimated regression models for the same input data - probability distributions for indicator values can be obtained, which provide the information on their uncertainty due to data errors. It is demonstrated how uncertainty in base quantities, such as the number of publications of a unit of certain document types and the number of citations of a publication, can be propagated along a measurement model into final indicator values. Synthetic examples are used to illustrate the method and real bibliometric research evaluation data is used to show its application in practice. Though in this contribution we just use two out of a larger number of known bibliometric error categories and therefore can account for only some part of the total uncertainty due to inaccuracies, the latter example reveals that average values of citation impact scores of publications of research groups need to be used very cautiously as they often have large margins of error resulting from data inaccuracies.

Data inaccuracy quantification and uncertainty propagation for bibliometric indicators

TL;DR

This paper tackles the problem of uncertainty in bibliometric indicators arising from data errors by combining empirical error distributions with Bayesian regression and Monte Carlo simulation. It shows how to propagate uncertainty from base quantities, such as citation counts and document-type assignments, through measurement models to obtain probabilistic indicator values. Through synthetic simulations and a real-world application to chemistry research groups, the authors reveal substantial uncertainty in common metrics (e.g., MNCS) that would be hidden under point estimates. The work advocates explicit uncertainty reporting in bibliometrics, arguing that data quality issues can meaningfully affect interpretation and decision-making, in line with the Leiden Manifesto.

Abstract

This study introduces an approach to estimate the uncertainty in bibliometric indicator values that is caused by data errors. This approach utilizes Bayesian regression models, estimated from empirical data samples, which are used to predict error-free data. Through direct Monte Carlo simulation - drawing many replicates of predicted data from the estimated regression models for the same input data - probability distributions for indicator values can be obtained, which provide the information on their uncertainty due to data errors. It is demonstrated how uncertainty in base quantities, such as the number of publications of a unit of certain document types and the number of citations of a publication, can be propagated along a measurement model into final indicator values. Synthetic examples are used to illustrate the method and real bibliometric research evaluation data is used to show its application in practice. Though in this contribution we just use two out of a larger number of known bibliometric error categories and therefore can account for only some part of the total uncertainty due to inaccuracies, the latter example reveals that average values of citation impact scores of publications of research groups need to be used very cautiously as they often have large margins of error resulting from data inaccuracies.
Paper Structure (17 sections, 5 figures, 11 tables)

This paper contains 17 sections, 5 figures, 11 tables.

Figures (5)

  • Figure 1: Scatterplot of WoS citation count and number of missing citations found
  • Figure 2: Estimates of normalized citations scores for a sample of publications, 1000 runs, symbols: $\times$ original error-affected data, $\bullet$ simulated error-free data
  • Figure 3: Estimated error-corrected data for publications (P), citations (C), and mean normalized citation score (MNCS) for 110 chemistry research groups, medians and 95% credible intervals, 1000 runs, symbols: $\times$ original error-affected data, $\bullet$ simulated error-free data
  • Figure 4: Relationship between P and uncertainty in MNCS, 110 chemistry research groups
  • Figure 5: Results simulation exercise A1 - data affected by citation count error