Table of Contents
Fetching ...

Does the Use of Unusual Combinations of Datasets Contribute to Greater Scientific Impact?

Yulin Yu, Daniel M. Romero

TL;DR

These findings underscore the potential of datasets combined unconventionally to elevate the impact of scientific discoveries and provide valuable insights for researchers, policymakers, and data curators.

Abstract

Scientific datasets play a crucial role in contemporary data-driven research, as they allow for the progress of science by facilitating the discovery of new patterns and phenomena. This mounting demand for empirical research raises important questions on how strategic data utilization in research projects can stimulate scientific advancement. In this study, we examine the hypothesis inspired by the recombination theory, which suggests that innovative combinations of existing knowledge, including the use of unusual combinations of datasets, can lead to high-impact discoveries. Focusing on social science, we investigate the scientific outcomes of such atypical data combinations in more than 30,000 publications that leverage over 5,000 datasets curated within one of the largest social science databases, ICPSR. This study offers four important insights. First, combining datasets, particularly those infrequently paired, significantly contributes to both scientific and broader impacts (e.g., dissemination to the general public). Second, infrequently paired datasets maintain a strong association with citation even after controlling for the atypicality of dataset topics. In contrast, the atypicality of dataset topics has a much smaller positive impact on citation counts. Third, smaller and less experienced research teams tend to use atypical combinations of datasets in research more frequently than their larger and more experienced counterparts. Lastly, despite the benefits of data combination, papers that amalgamate data remain infrequent. This finding suggests that the unconventional combination of datasets is an under-utilized but powerful strategy correlated with the scientific impact and broader dissemination of scientific discoveries

Does the Use of Unusual Combinations of Datasets Contribute to Greater Scientific Impact?

TL;DR

These findings underscore the potential of datasets combined unconventionally to elevate the impact of scientific discoveries and provide valuable insights for researchers, policymakers, and data curators.

Abstract

Scientific datasets play a crucial role in contemporary data-driven research, as they allow for the progress of science by facilitating the discovery of new patterns and phenomena. This mounting demand for empirical research raises important questions on how strategic data utilization in research projects can stimulate scientific advancement. In this study, we examine the hypothesis inspired by the recombination theory, which suggests that innovative combinations of existing knowledge, including the use of unusual combinations of datasets, can lead to high-impact discoveries. Focusing on social science, we investigate the scientific outcomes of such atypical data combinations in more than 30,000 publications that leverage over 5,000 datasets curated within one of the largest social science databases, ICPSR. This study offers four important insights. First, combining datasets, particularly those infrequently paired, significantly contributes to both scientific and broader impacts (e.g., dissemination to the general public). Second, infrequently paired datasets maintain a strong association with citation even after controlling for the atypicality of dataset topics. In contrast, the atypicality of dataset topics has a much smaller positive impact on citation counts. Third, smaller and less experienced research teams tend to use atypical combinations of datasets in research more frequently than their larger and more experienced counterparts. Lastly, despite the benefits of data combination, papers that amalgamate data remain infrequent. This finding suggests that the unconventional combination of datasets is an under-utilized but powerful strategy correlated with the scientific impact and broader dissemination of scientific discoveries
Paper Structure (21 sections, 10 equations, 9 figures, 34 tables)

This paper contains 21 sections, 10 equations, 9 figures, 34 tables.

Figures (9)

  • Figure 1: The plot illustrates the effect size of the data combination variable on citations over 3, 5, and 10 years based on a Negative Binomial regression. The error bars indicate the 95% confidence intervals. This regression controls for dataset usage frequency, author attributes (i.e., the number of authors and their experience), journal impact factor, publication year, and subject areas. The results show a positive effect of data combination on citations over 3, 5, and 10 years. The inset of the figure shows the effect size and 95% confidence intervals for analyses conducted on publications published in four distinct periods: before 1990, 1990-2000, 2000-2010, and 2010-2020. The results reveal that the effect of data combination on citations is driven by papers published in more recent years, particularly those after 2000.
  • Figure 1: Distribution of number of datasets use
  • Figure 2: Atypical combinations of datasets lead to higher citation rates and broader dissemination. (A) The plot illustrates the effect size of the atypicality in dataset combinations variable on citations over 3 years based on a Negative Binomial regression controlling for the various factors indicated in the panel headings. The leftmost panels display effect size of atypicality in baseline regressions (without any control variables), while the rightmost panels display effect size after collectively controlling for dataset attributes (use frequency and number of datasets), author attributes (number of authors and experience), journal impact factor, publication year, subjects, and paper novelty. The error bars indicate the 95% confidence intervals. (B) The effect size and 95% confidence intervals provide insights into the impact of atypicality in dataset combination on Twitter, Wikipedia, policy, and news mentions (outcome variables) based on a Negative Binomial regression. This regression incorporates all of the control variables listed above. (C) Illustration of quantifying atypicality of datasets using the Rao-Stirling index. In this illustration, we assume that a paper uses two datasets, namely $dataset_1$ and $dataset_2$. We first vectorize each dataset into a one-hot vector. Each coordinate in the vector corresponds to a paper in our dataset, and the coordinate takes a value of 1 if the respective dataset is used in that paper and a value of 0 if otherwise. Subsequently, we calculate the distance, denoted as $D_{12}$, by computing cosine similarity between $dataset1$ and $dataset2$. Using the number of datasets in a given paper, we calculate the parameters $P_1$ and $P_2$, representing the ratio of a dataset used within a research paper. In this particular scenario, where two datasets are used in the paper, both $P_1$ and $P_2$ are equal to 1/2. We then use the equation shown in the figure to quantify the atypicality of the datasets used by the paper.
  • Figure 2: Distribution of 3, 5, and 10 year citation
  • Figure 3: A paper's atypical combination of datasets is more impactful than atypical combination of datasets' topics. (A) Illustration of quantifying topic atypicality: In this illustration, we consider a hypothetical paper that utilizes two datasets, namely $dataset_1$ and $dataset_2$. The first dataset, $dataset_1$, is associated with two topic tags: $Topic_1$ and $Topic_2$, while the second dataset, $dataset_2$, is associated with two topic tags: $Topic_2$ and $Topic_3$. We combine all the topics from both datasets, resulting in a topic set containing $Topic_1$, $Topic_2$, and $Topic_3$. Subsequently, we represent each of these topics as a one-hot vector. In this representation, each coordinate in the vector corresponds to a paper, and the coordinate takes a value of 1 if the respective topic is present in that paper; otherwise, it takes a value of 0. Using cosine similarity, we calculate the distance between these topic vectors, and we apply similar quantification methods to all pairs of topics. This process allows us to determine the topic atypicality within the paper. (B) The effect size of topic atypicality and atypicality of data combinations variables on citations over 3, 5, and 10 years based on a Negative Binomial regression. The error bars indicate the 95% confidence intervals. Our model includes two main independent variables: topic atypicality and atypicality of data combinations. The dependent variable is the three, five, and ten-year citation counts, and the model incorporates all the control variables described in the preceding section (full control setting). (C) The effect size and 95% confidence intervals are presented separately for analyses conducted on publications published in four distinct periods: before 1990, 1990-2000, 2000-2010, and 2010-2020. Our results reveal that effects are most influenced by the more recent years, particularly those after 2000.
  • ...and 4 more figures