Representing the Disciplinary Structure of Physics: A Comparative Evaluation of Graph and Text Embedding Methods
Isabel Constantino, Sadamori Kojaku, Santo Fortunato, Yong-Yeol Ahn
TL;DR
This work directly compares graph-based and text-based embeddings for their ability to recover the hierarchical PACS structure in APS physics papers, using a large, real-world APS corpus. By conducting KNN classification, hierarchy-aware radius-of-gyration analyses, pairwise distance comparisons, and link-prediction tasks, the authors reveal that graph embeddings (notably node2vec and residual2vec) consistently better capture disciplinary structure than content-only text embeddings, while Sentence-BERT provides the strongest performance among text methods. The study demonstrates the value of citation-network context for science mapping, while also showing that powerful language models can extract meaningful signals from limited text (titles/abstracts). The findings inform how researchers should choose, combine, and evaluate embedding representations for knowledge-space analyses and science studies, especially when data coverage and access vary across domains.
Abstract
Recent advances in machine learning offer new ways to represent and study scholarly works and the space of knowledge. Graph and text embeddings provide a convenient vector representation of scholarly works based on citations and text. Yet, it is unclear whether their representations are consistent or provide different views of the structure of science. Here, we compare graph and text embedding by testing their ability to capture the hierarchical structure of the Physics and Astronomy Classification Scheme (PACS) of papers published by the American Physical Society (APS). We also provide a qualitative comparison of the overall structure of the graph and text embeddings for reference. We find that neural network-based methods outperform traditional methods and graph embedding methods such as node2vec are better than other methods at capturing the PACS structure. Our results call for further investigations into how different contexts of scientific papers are captured by different methods, and how we can combine and leverage such information in an interpretable manner.
