Table of Contents
Fetching ...

Aster: Enhancing LSM-structures for Scalable Graph Database

Dingheng Mo, Junfeng Liu, Fan Wang, Siqiang Luo

TL;DR

This work tackles the challenge of efficiently storing and querying large, evolving graphs on disk. It introduces Poly-LSM, a graph-oriented LSM-tree that blends vertex-based and edge-based layouts through multiple entry types (pivot and delta) and an adaptive update mechanism guided by a derived I/O cost model, complemented by space-efficient encoding via partitioned Elias-Fano and a degree sketch for degree-based decisions. Building on Poly-LSM, the authors implement Aster, a Gremlin-enabled graph database with MVCC support and Gremlin/TinkerPop integration, achieving robust, scalable performance across diverse real-world and property-graph workloads. Empirical results show Aster outperforming mainstream baselines on large-scale graphs (e.g., up to 17x throughput gains on Twitter-scale data), while maintaining better stability under workload shifts due to adaptive updates. Overall, the work demonstrates that a graph-oriented, adaptive LSM-storage engine can deliver substantial gains in update and lookup efficiency for disk-resident graphs, with practical impact for contemporary graph-backed applications.

Abstract

There is a proliferation of applications requiring the management of large-scale, evolving graphs under workloads with intensive graph updates and lookups. Driven by this challenge, we introduce Poly-LSM, a high-performance key-value storage engine for graphs with the following novel techniques: (1) Poly-LSM is embedded with a new design of graph-oriented LSM-tree structure that features a hybrid storage model for concisely and effectively storing graph data. (2) Poly-LSM utilizes an adaptive mechanism to handle edge insertions and deletions on graphs with optimized I/O efficiency. (3) Poly-LSM exploits the skewness of graph data to encode the key-value entries. Building upon this foundation, we further implement Aster, a robust and versatile graph database that supports Gremlin query language facilitating various graph applications. In our experiments, we compared Aster against several mainstream real-world graph databases. The results demonstrate that Aster outperforms all baseline graph databases, especially on large-scale graphs. Notably, on the billion-scale Twitter graph dataset, Aster achieves up to 17x throughput improvement compared to the best-performing baseline graph system.

Aster: Enhancing LSM-structures for Scalable Graph Database

TL;DR

This work tackles the challenge of efficiently storing and querying large, evolving graphs on disk. It introduces Poly-LSM, a graph-oriented LSM-tree that blends vertex-based and edge-based layouts through multiple entry types (pivot and delta) and an adaptive update mechanism guided by a derived I/O cost model, complemented by space-efficient encoding via partitioned Elias-Fano and a degree sketch for degree-based decisions. Building on Poly-LSM, the authors implement Aster, a Gremlin-enabled graph database with MVCC support and Gremlin/TinkerPop integration, achieving robust, scalable performance across diverse real-world and property-graph workloads. Empirical results show Aster outperforming mainstream baselines on large-scale graphs (e.g., up to 17x throughput gains on Twitter-scale data), while maintaining better stability under workload shifts due to adaptive updates. Overall, the work demonstrates that a graph-oriented, adaptive LSM-storage engine can deliver substantial gains in update and lookup efficiency for disk-resident graphs, with practical impact for contemporary graph-backed applications.

Abstract

There is a proliferation of applications requiring the management of large-scale, evolving graphs under workloads with intensive graph updates and lookups. Driven by this challenge, we introduce Poly-LSM, a high-performance key-value storage engine for graphs with the following novel techniques: (1) Poly-LSM is embedded with a new design of graph-oriented LSM-tree structure that features a hybrid storage model for concisely and effectively storing graph data. (2) Poly-LSM utilizes an adaptive mechanism to handle edge insertions and deletions on graphs with optimized I/O efficiency. (3) Poly-LSM exploits the skewness of graph data to encode the key-value entries. Building upon this foundation, we further implement Aster, a robust and versatile graph database that supports Gremlin query language facilitating various graph applications. In our experiments, we compared Aster against several mainstream real-world graph databases. The results demonstrate that Aster outperforms all baseline graph databases, especially on large-scale graphs. Notably, on the billion-scale Twitter graph dataset, Aster achieves up to 17x throughput improvement compared to the best-performing baseline graph system.
Paper Structure (14 sections, 2 theorems, 12 equations, 9 figures, 6 tables, 1 algorithm)

This paper contains 14 sections, 2 theorems, 12 equations, 9 figures, 6 tables, 1 algorithm.

Key Result

lemma 1

When the workload is uniformly distributed, the total cost of Poly-LSM is $O(1)$-competitive. When the workload is skewed, Poly-LSM is $O(\log m)$-competitive.

Figures (9)

  • Figure 1: The left figure illustrates the sub-optimal trade-off between lookup and update in existing LSM-based structures. The right figure demonstrates our insight that the superiority of vertex-based updates and edge-based updates are affected by lookup ratio and vertex degree.
  • Figure 2: Diagram of the LSM-tree structure.
  • Figure 3: An overview of the Poly-LSM demonstrating the workflow of several basic operations including add new vertex, delta edge update, and pivot edge update. In this figure, Buf represents the memory buffer of the LSM-tree.
  • Figure 4: The degree sketch allows Poly-LSM to index the degree of each vertex in memory with very few bits.
  • Figure 5: The architecture of Aster.
  • ...and 4 more figures

Theorems & Definitions (2)

  • lemma 1
  • lemma 2