CompanyKG: A Large-Scale Heterogeneous Graph for Company Similarity Quantification
Lele Cao, Vilhelm von Ehrenheim, Mark Granroth-Wilding, Richard Anselmo Stahl, Andrew McCornack, Armin Catovic, Dhiana Deva Cavacanti Rocha
TL;DR
CompanyKG introduces a real-world, large-scale heterogeneous knowledge graph for company similarity quantification, featuring 1.17M nodes and 51.06M weighted edges across 15 relation types. It defines three evaluation tasks (Similarity Prediction, Competitor Retrieval, Similarity Ranking) and provides extensive benchmarks across 11 baselines grouped into node-only, edge-only, and node+edge categories, including novel self-supervised graph methods like GraphMAE and eGraphMAE. Findings indicate node-based signals excel in similarity prediction, edge-based signals dominate competitor retrieval, and combining node and edge information yields competitive results without a single method dominating all tasks. By releasing CompanyKG and its evaluation suite, the work offers a practical benchmark for graph learning in investment contexts and highlights directions such as temporal knowledge graphs to capture evolution over time.
Abstract
In the investment industry, it is often essential to carry out fine-grained company similarity quantification for a range of purposes, including market mapping, competitor analysis, and mergers and acquisitions. We propose and publish a knowledge graph, named CompanyKG, to represent and learn diverse company features and relations. Specifically, 1.17 million companies are represented as nodes enriched with company description embeddings; and 15 different inter-company relations result in 51.06 million weighted edges. To enable a comprehensive assessment of methods for company similarity quantification, we have devised and compiled three evaluation tasks with annotated test sets: similarity prediction, competitor retrieval and similarity ranking. We present extensive benchmarking results for 11 reproducible predictive methods categorized into three groups: node-only, edge-only, and node+edge. To the best of our knowledge, CompanyKG is the first large-scale heterogeneous graph dataset originating from a real-world investment platform, tailored for quantifying inter-company similarity.
