Mapping the Web of Science, a large-scale graph and text-based dataset with LLM embeddings
Tim Kunt, Annika Buchholz, Imene Khebouri, Thorsten Koch, Ida Litzel, Thi Huong Vu
TL;DR
This work tackles large-scale scientometric mapping by integrating text semantics from LLM embeddings with the structural information of citation graphs. Using the Ollama framework, it embeds abstracts with two models ($1024$- and $768$-dimensional, respectively) and analyzes their geometry on an $n$-sphere, comparing semantic distances to graph-based shortest-path distances in Web of Science data. It finds a positive correlation between embedding and graph distances (PCC around $0.34$–$0.46$) and identifies 255 subject centers whose embeddings form clear, cloud-like distributions across natural, social, and humanities sciences, supporting soft, multi-label classifications. The results motivate a hybrid embedding–graph distance metric to improve topic classification and clustering at scale, with practical implications for robust, scalable knowledge organization in large bibliographic corpora.
Abstract
Large text data sets, such as publications, websites, and other text-based media, inherit two distinct types of features: (1) the text itself, its information conveyed through semantics, and (2) its relationship to other texts through links, references, or shared attributes. While the latter can be described as a graph structure and can be handled by a range of established algorithms for classification and prediction, the former has recently gained new potential through the use of LLM embedding models. Demonstrating these possibilities and their practicability, we investigate the Web of Science dataset, containing ~56 million scientific publications through the lens of our proposed embedding method, revealing a self-structured landscape of texts.
