Table of Contents
Fetching ...

Representing the Disciplinary Structure of Physics: A Comparative Evaluation of Graph and Text Embedding Methods

Isabel Constantino, Sadamori Kojaku, Santo Fortunato, Yong-Yeol Ahn

TL;DR

This work directly compares graph-based and text-based embeddings for their ability to recover the hierarchical PACS structure in APS physics papers, using a large, real-world APS corpus. By conducting KNN classification, hierarchy-aware radius-of-gyration analyses, pairwise distance comparisons, and link-prediction tasks, the authors reveal that graph embeddings (notably node2vec and residual2vec) consistently better capture disciplinary structure than content-only text embeddings, while Sentence-BERT provides the strongest performance among text methods. The study demonstrates the value of citation-network context for science mapping, while also showing that powerful language models can extract meaningful signals from limited text (titles/abstracts). The findings inform how researchers should choose, combine, and evaluate embedding representations for knowledge-space analyses and science studies, especially when data coverage and access vary across domains.

Abstract

Recent advances in machine learning offer new ways to represent and study scholarly works and the space of knowledge. Graph and text embeddings provide a convenient vector representation of scholarly works based on citations and text. Yet, it is unclear whether their representations are consistent or provide different views of the structure of science. Here, we compare graph and text embedding by testing their ability to capture the hierarchical structure of the Physics and Astronomy Classification Scheme (PACS) of papers published by the American Physical Society (APS). We also provide a qualitative comparison of the overall structure of the graph and text embeddings for reference. We find that neural network-based methods outperform traditional methods and graph embedding methods such as node2vec are better than other methods at capturing the PACS structure. Our results call for further investigations into how different contexts of scientific papers are captured by different methods, and how we can combine and leverage such information in an interpretable manner.

Representing the Disciplinary Structure of Physics: A Comparative Evaluation of Graph and Text Embedding Methods

TL;DR

This work directly compares graph-based and text-based embeddings for their ability to recover the hierarchical PACS structure in APS physics papers, using a large, real-world APS corpus. By conducting KNN classification, hierarchy-aware radius-of-gyration analyses, pairwise distance comparisons, and link-prediction tasks, the authors reveal that graph embeddings (notably node2vec and residual2vec) consistently better capture disciplinary structure than content-only text embeddings, while Sentence-BERT provides the strongest performance among text methods. The study demonstrates the value of citation-network context for science mapping, while also showing that powerful language models can extract meaningful signals from limited text (titles/abstracts). The findings inform how researchers should choose, combine, and evaluate embedding representations for knowledge-space analyses and science studies, especially when data coverage and access vary across domains.

Abstract

Recent advances in machine learning offer new ways to represent and study scholarly works and the space of knowledge. Graph and text embeddings provide a convenient vector representation of scholarly works based on citations and text. Yet, it is unclear whether their representations are consistent or provide different views of the structure of science. Here, we compare graph and text embedding by testing their ability to capture the hierarchical structure of the Physics and Astronomy Classification Scheme (PACS) of papers published by the American Physical Society (APS). We also provide a qualitative comparison of the overall structure of the graph and text embeddings for reference. We find that neural network-based methods outperform traditional methods and graph embedding methods such as node2vec are better than other methods at capturing the PACS structure. Our results call for further investigations into how different contexts of scientific papers are captured by different methods, and how we can combine and leverage such information in an interpretable manner.
Paper Structure (16 sections, 1 equation, 7 figures, 2 tables)

This paper contains 16 sections, 1 equation, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Embedding and evaluation framework for APS papers.
  • Figure 2: UMAP projections on a sample of the embeddings show that Sentence-BERT,node2vec, and residual2vec all follow the general clustering structure of physics research.
  • Figure 3: Classification of PACS by a k-nearest neighbor algorithm (left) with $k = 2$ to 128; and (right; inset) with micro-F1 score $> 0.68$ and $k \in \{ 2,4,8 \}$. The graph embedding methods outperform all text embedding methods. Among the text embeddings, Sentence-BERT performs best, though not as well as the graph embeddings. With doc2vec, the abstract embedding results in improved classification performance compared to the title embedding.
  • Figure 4: Box plots indicating the distributions of PACS code ROG show that a deeper PACS level (left to right) only results in a lower or more left-skewed ROG distribution in the Sentence-BERT, Laplacian Eigenmap, node2vec, and residual2vec embeddings.
  • Figure 5: A comparison of embedding distance distribution between sampled paper pairs also show that Sentence-BERT (title and abstract embeddings), node2vec, and residual2vec embeddings are more likely to embed papers of the same discipline closer to one another than random pairs of papers.
  • ...and 2 more figures