Use of diverse data sources to control which topics emerge in a science map
Juan Pablo Bascur, Rodrigo Costas, Suzan Verberne
TL;DR
This work tackles the biased topic representation inherent in traditional science maps by introducing diverse external data sources to control topic emergence. It builds six bipartite external-source networks (AUTHORS, FACEBOOK, TWUSER, TWCONV, PATENT, POLICY) and compares them to text-similarity (Sentence-BERT) and citation baselines using MeSH-based topic categories across a massive corpus. A refined evaluation framework based on Purity profiles and topic-category analyses shows that external sources can preferentially illuminate different topic areas (e.g., health with Facebook, biotechnology with patents, geography with authors), and that Twitter conversations often yield strong, distinctive signals when combined with text similarity. These findings enable tailored science maps for specific needs, while highlighting the value and limitations of integrating heterogeneous data sources to capture diverse organizational structures of scientific knowledge. The approach offers practical avenues for targeted mapping, policy analysis, and understanding societal perceptions of science, with avenues for future work on optimal data-source mixing and broader accessibility of social data.
Abstract
Traditional science maps visualize topics by clustering documents, but they are inherently biased toward clustering certain topics over others. If these topics could be chosen, then the science maps could be tailored for different needs. In this paper, we explore the use of document networks from diverse data sources as a tool to control the topic clustering bias of a science map. We analyze this by evaluating the clustering effectiveness of several topic categories over two traditional and six non-traditional data sources. We found that the topics favored in each non-traditional data source are about: Health for Facebook users, biotechnology for patent families, government and social issues for policy documents, food for Twitter conversations, nursing for Twitter users, and geographical entities for document authors (the favoring in this latter source was particularly strong). Our results show that diverse data sources can be used to control topic bias, which opens up the possibility of creating science maps tailored for different needs.
