Table of Contents
Fetching ...

Use of diverse data sources to control which topics emerge in a science map

Juan Pablo Bascur, Rodrigo Costas, Suzan Verberne

TL;DR

This work tackles the biased topic representation inherent in traditional science maps by introducing diverse external data sources to control topic emergence. It builds six bipartite external-source networks (AUTHORS, FACEBOOK, TWUSER, TWCONV, PATENT, POLICY) and compares them to text-similarity (Sentence-BERT) and citation baselines using MeSH-based topic categories across a massive corpus. A refined evaluation framework based on Purity profiles and topic-category analyses shows that external sources can preferentially illuminate different topic areas (e.g., health with Facebook, biotechnology with patents, geography with authors), and that Twitter conversations often yield strong, distinctive signals when combined with text similarity. These findings enable tailored science maps for specific needs, while highlighting the value and limitations of integrating heterogeneous data sources to capture diverse organizational structures of scientific knowledge. The approach offers practical avenues for targeted mapping, policy analysis, and understanding societal perceptions of science, with avenues for future work on optimal data-source mixing and broader accessibility of social data.

Abstract

Traditional science maps visualize topics by clustering documents, but they are inherently biased toward clustering certain topics over others. If these topics could be chosen, then the science maps could be tailored for different needs. In this paper, we explore the use of document networks from diverse data sources as a tool to control the topic clustering bias of a science map. We analyze this by evaluating the clustering effectiveness of several topic categories over two traditional and six non-traditional data sources. We found that the topics favored in each non-traditional data source are about: Health for Facebook users, biotechnology for patent families, government and social issues for policy documents, food for Twitter conversations, nursing for Twitter users, and geographical entities for document authors (the favoring in this latter source was particularly strong). Our results show that diverse data sources can be used to control topic bias, which opens up the possibility of creating science maps tailored for different needs.

Use of diverse data sources to control which topics emerge in a science map

TL;DR

This work tackles the biased topic representation inherent in traditional science maps by introducing diverse external data sources to control topic emergence. It builds six bipartite external-source networks (AUTHORS, FACEBOOK, TWUSER, TWCONV, PATENT, POLICY) and compares them to text-similarity (Sentence-BERT) and citation baselines using MeSH-based topic categories across a massive corpus. A refined evaluation framework based on Purity profiles and topic-category analyses shows that external sources can preferentially illuminate different topic areas (e.g., health with Facebook, biotechnology with patents, geography with authors), and that Twitter conversations often yield strong, distinctive signals when combined with text similarity. These findings enable tailored science maps for specific needs, while highlighting the value and limitations of integrating heterogeneous data sources to capture diverse organizational structures of scientific knowledge. The approach offers practical avenues for targeted mapping, policy analysis, and understanding societal perceptions of science, with avenues for future work on optimal data-source mixing and broader accessibility of social data.

Abstract

Traditional science maps visualize topics by clustering documents, but they are inherently biased toward clustering certain topics over others. If these topics could be chosen, then the science maps could be tailored for different needs. In this paper, we explore the use of document networks from diverse data sources as a tool to control the topic clustering bias of a science map. We analyze this by evaluating the clustering effectiveness of several topic categories over two traditional and six non-traditional data sources. We found that the topics favored in each non-traditional data source are about: Health for Facebook users, biotechnology for patent families, government and social issues for policy documents, food for Twitter conversations, nursing for Twitter users, and geographical entities for document authors (the favoring in this latter source was particularly strong). Our results show that diverse data sources can be used to control topic bias, which opens up the possibility of creating science maps tailored for different needs.

Paper Structure

This paper contains 31 sections, 2 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Example of a Purity profile. This is a line plot of the Purity profile of the topic Bacillus thuringiensis [B03.510.460.410.158.218.800] for the Policy documents BERT network calculated using Coverage 0.50. This topic has 60 topic documents among the core documents used by the Policy networks, which for this Coverage value means that the Purity is calculated after selecting clusters that contain at least 30 topic documents. So for example, if we assume that the selected clusters contain exactly 30 topic documents, from the figure we can say that at different Resolution values the network can place 30 out the 60 topic documents in one cluster containing 150 documents ($30/0.2$), two clusters containing 75 documents ($30/0.4$), and four clusters containing 50 documents ($30/0.6$). Using lower Coverage values or topics with more topic documents tends to achieve higher Purity at the highest NSC value.
  • Figure 2: Diagram on the representation of results. A: How to calculate from topic Purity profiles if a topic has higher clustering effectiveness than BERT in the Pure or the Mixed network. In this example, a topic has higher Purity than BERT for the Mixed network, but not so for the Pure network. B: How to calculate from topic category Purity profiles the number of NSC that a topic category is in the top third Purity of a network. In this example, the topic categories A, B and C achieve a top third count of 0.7, 0.3 and 0, respectively.
  • Figure 3: Examples of Purity of several topic categories for different networks. All profiles are for Size bin 161-320 and Coverage 0.50. To interpret these plots, it is important to keep in mind that each profile represents the average Purity and NSC across all topics in the topic category and Size bin, based on multiple clustering solutions.One way to interpret each curve is as if it were the Purity profile of a single, imaginary topic that combines all the topics in the category, including both the high- and low-performing ones. This topic would contain 240 documents (the average size of the bin), with each NSC value in the curve including 120 topic documents (due to Coverage 0.50). Purity values should not be compared across different sources, as some networks are substantially smaller, reducing clustering quality due to lack of information and making such comparisons unfair.
  • Figure 4: Examples of Purity profiles for individual topics across different networks. All Purity profiles are calculated for Coverage 0.50. The title of each plot indicates the external source, topic category, topic name and topic size.