Table of Contents
Fetching ...

Mapping the Web of Science, a large-scale graph and text-based dataset with LLM embeddings

Tim Kunt, Annika Buchholz, Imene Khebouri, Thorsten Koch, Ida Litzel, Thi Huong Vu

TL;DR

This work tackles large-scale scientometric mapping by integrating text semantics from LLM embeddings with the structural information of citation graphs. Using the Ollama framework, it embeds abstracts with two models ($1024$- and $768$-dimensional, respectively) and analyzes their geometry on an $n$-sphere, comparing semantic distances to graph-based shortest-path distances in Web of Science data. It finds a positive correlation between embedding and graph distances (PCC around $0.34$–$0.46$) and identifies 255 subject centers whose embeddings form clear, cloud-like distributions across natural, social, and humanities sciences, supporting soft, multi-label classifications. The results motivate a hybrid embedding–graph distance metric to improve topic classification and clustering at scale, with practical implications for robust, scalable knowledge organization in large bibliographic corpora.

Abstract

Large text data sets, such as publications, websites, and other text-based media, inherit two distinct types of features: (1) the text itself, its information conveyed through semantics, and (2) its relationship to other texts through links, references, or shared attributes. While the latter can be described as a graph structure and can be handled by a range of established algorithms for classification and prediction, the former has recently gained new potential through the use of LLM embedding models. Demonstrating these possibilities and their practicability, we investigate the Web of Science dataset, containing ~56 million scientific publications through the lens of our proposed embedding method, revealing a self-structured landscape of texts.

Mapping the Web of Science, a large-scale graph and text-based dataset with LLM embeddings

TL;DR

This work tackles large-scale scientometric mapping by integrating text semantics from LLM embeddings with the structural information of citation graphs. Using the Ollama framework, it embeds abstracts with two models (- and -dimensional, respectively) and analyzes their geometry on an -sphere, comparing semantic distances to graph-based shortest-path distances in Web of Science data. It finds a positive correlation between embedding and graph distances (PCC around ) and identifies 255 subject centers whose embeddings form clear, cloud-like distributions across natural, social, and humanities sciences, supporting soft, multi-label classifications. The results motivate a hybrid embedding–graph distance metric to improve topic classification and clustering at scale, with practical implications for robust, scalable knowledge organization in large bibliographic corpora.

Abstract

Large text data sets, such as publications, websites, and other text-based media, inherit two distinct types of features: (1) the text itself, its information conveyed through semantics, and (2) its relationship to other texts through links, references, or shared attributes. While the latter can be described as a graph structure and can be handled by a range of established algorithms for classification and prediction, the former has recently gained new potential through the use of LLM embedding models. Demonstrating these possibilities and their practicability, we investigate the Web of Science dataset, containing ~56 million scientific publications through the lens of our proposed embedding method, revealing a self-structured landscape of texts.
Paper Structure (9 sections, 3 figures)

This paper contains 9 sections, 3 figures.

Figures (3)

  • Figure 1: PCA variance explained as a function of number of dimensions
  • Figure 2: Comparing pairwise distances of records in the embedding space and on the citation graph
  • Figure 3: Visualization of the embedded abstracts using PCA - A Map of Sciences