From Coverage to Prestige: A Comprehensive Assessment of Large-Scale Scientometric Data
Guoyang Rong, Ying Chen, Thorsten Koch, Keisuke Honda
TL;DR
This study systematically compares Web of Science and Crossref as large-scale scientometric data sources and evaluates the impact of merging them on data completeness and quality. By employing Reference Coverage Rate (RCR) and Article Scientific Prestige (ASP), the authors demonstrate that WoS yields higher high-impact coverage while Crossref broadens literature inclusion; merging both datasets significantly improves citation completeness, particularly in smaller disciplines, yet introduces low-impact citations that polarize overall data quality. The analysis reveals discipline-dependent effects, with Science, Biology, and Medicine benefiting most from merging, and Arts and Social Sciences more vulnerable to quality dilution. The work provides a practical framework for dataset assessment and underscores the trade-offs in data integration for scientometric research.
Abstract
As research in the Scientometric deepens, the impact of data quality on research outcomes has garnered increasing attention. This study, based on Web of Science (WoS) and Crossref datasets, systematically evaluates the differences between data sources and the effects of data merging through matching, comparison, and integration. Two core metrics were employed: Reference Coverage Rate (RCR) and Article Scientific Prestige (ASP), which respectively measure citation completeness (quantity) and academic influence (quality). The results indicate that the WoS dataset outperforms Crossref in its coverage of high-impact literature and ASP scores, while the Crossref dataset provides complementary value through its broader coverage of literature. Data merging significantly improves the completeness of the citation network, with particularly pronounced benefits in smaller disciplinary clusters such as Education and Arts. However, data merging also introduces some low-quality citations, resulting in a polarization of overall data quality. Moreover, the impact of data merging varies across disciplines; high-impact clusters such as Science, Biology, and Medicine benefit the most, whereas clusters like Social Sciences and Arts are more vulnerable to negative effects. This study highlights the critical role of data sources in Scientometric research and provides a framework for assessing and improving data quality.
