Table of Contents
Fetching ...

Color: A Framework for Applying Graph Coloring to Subgraph Cardinality Estimation

Kyle Deeds, Diandre Sabale, Moe Kayali, Dan Suciu

TL;DR

COLOR tackles subgraph cardinality estimation for graph workloads by building a compact lifted graph via colorings that capture topology and reduce nonuniformity and correlation. It defines lifted subgraph counting with W(pi,Q,G) and Phi(Q, G) for acyclic queries and extends to cycles using cycle/path closure probabilities, supported by optimizations like tree decompositions, partial aggregation, and sampling (Thompson-Horowitz with importance sampling). The framework demonstrates up to ~10^3× accuracy improvements over baselines, sub-MB summary sizes, fast inference times, and robust handling of updates, enabling scalable query optimization for real-world graphs. Its contributions include a formal COLOR framework, multiple coloring strategies, provable guarantees for special cases, and extensive empirical validation across diverse datasets. This work advances practical, accurate cardinality estimation for complex graph workloads with strong performance and update resilience implications for graph DBMS query optimization.

Abstract

Graph workloads pose a particularly challenging problem for query optimizers. They typically feature large queries made up of entirely many-to-many joins with complex correlations. This puts significant stress on traditional cardinality estimation methods which generally see catastrophic errors when estimating the size of queries with only a handful of joins. To overcome this, we propose COLOR, a framework for subgraph cardinality estimation which applies insights from graph compression theory to produce a compact summary that captures the global topology of the data graph. Further, we identify several key optimizations that enable tractable estimation over this summary even for large query graphs. We then evaluate several designs within this framework and find that they improve accuracy by up to 10$^3$x over all competing methods while maintaining fast inference, a small memory footprint, efficient construction, and graceful degradation under updates.

Color: A Framework for Applying Graph Coloring to Subgraph Cardinality Estimation

TL;DR

COLOR tackles subgraph cardinality estimation for graph workloads by building a compact lifted graph via colorings that capture topology and reduce nonuniformity and correlation. It defines lifted subgraph counting with W(pi,Q,G) and Phi(Q, G) for acyclic queries and extends to cycles using cycle/path closure probabilities, supported by optimizations like tree decompositions, partial aggregation, and sampling (Thompson-Horowitz with importance sampling). The framework demonstrates up to ~10^3× accuracy improvements over baselines, sub-MB summary sizes, fast inference times, and robust handling of updates, enabling scalable query optimization for real-world graphs. Its contributions include a formal COLOR framework, multiple coloring strategies, provable guarantees for special cases, and extensive empirical validation across diverse datasets. This work advances practical, accurate cardinality estimation for complex graph workloads with strong performance and update resilience implications for graph DBMS query optimization.

Abstract

Graph workloads pose a particularly challenging problem for query optimizers. They typically feature large queries made up of entirely many-to-many joins with complex correlations. This puts significant stress on traditional cardinality estimation methods which generally see catastrophic errors when estimating the size of queries with only a handful of joins. To overcome this, we propose COLOR, a framework for subgraph cardinality estimation which applies insights from graph compression theory to produce a compact summary that captures the global topology of the data graph. Further, we identify several key optimizations that enable tractable estimation over this summary even for large query graphs. We then evaluate several designs within this framework and find that they improve accuracy by up to 10x over all competing methods while maintaining fast inference, a small memory footprint, efficient construction, and graceful degradation under updates.
Paper Structure (43 sections, 3 theorems, 23 equations, 14 figures, 3 tables, 1 algorithm)

This paper contains 43 sections, 3 theorems, 23 equations, 14 figures, 3 tables, 1 algorithm.

Key Result

Theorem 1

Let $\mathcal{G}$ be a lifted graph defined by a stable coloring $\sigma$. Then $\tau_{\text{min}}=\tau_{\text{avg}}=\tau_{\text{max}}$, and, for any acyclic query $Q$, the lifted graph estimator is exact:

Figures (14)

  • Figure 1: Lifted counting example. We estimate the number of occurrences of the query graph $Q$ pattern ($u \rightarrow v \leftarrow w$) in the data graph $G$. First, the data graph is partitioned offline: the resulting summary is stored as the lifted graph$\mathcal{G}$. At runtime, the cardinality estimate is computed on the lifted graph $\mathcal{G}$ without reference to the underlying data graph.
  • Figure 2: Accuracy of coloring as the number of colors increases
  • Figure 3: Relative Error by Estimator
  • Figure 4: Inference Time by Estimator
  • Figure 5: Relative Error by Cardinality Bound Method
  • ...and 9 more figures

Theorems & Definitions (18)

  • Definition 1
  • Definition 2
  • Definition 3
  • Definition 4
  • Example 1
  • Definition 5
  • Definition 6
  • Example 2
  • Definition 7
  • Definition 8: Lifted Estimator for Acyclic Queries
  • ...and 8 more