Making the complete OpenAIRE citation graph easily accessible through compact data representation
Joakim Skarding, Pavel Sanda
TL;DR
The paper tackles the challenge of accessibility for the OpenAIRE citation graph by delivering a downscaled, yet structurally faithful, 32GB representation in CSV format. It details a memory-efficient processing pipeline that translates OpenAIRE IDs to compact int32 node IDs, retains only citation edges, and updates all relations accordingly, enabling processing on conventional hardware. The work provides clear dataset metadata, open repositories for data and code, and demonstrates substantial practical impact by making large-scale citation networks usable for humanities and scientific computing applications. This approach lowers barriers for reproducible analysis, temporal modeling, and graph-based learning using OpenAIRE data.
Abstract
The OpenAIRE graph contains a large citation graph dataset, with over 200 million publications and over 2 billion citations. The current graph is available as a dump with metadata which uncompressed totals ~TB. This makes it hard to process on conventional computers. To make this network more available for the community we provide a processed OpenAIRE graph which is downscaled to 32GB, while preserving the full graph structure. Apart from this we offer the processed data in very simple format, which allows further straightforward manipulation. We also provide a python pipeline, which can be used to process the next releases of the OpenAIRE graph.
