Table of Contents
Fetching ...

Relation-Stratified Sampling for Shapley Values Estimation in Relational Databases

Amirhossein Alizad, Mostafa Milani

TL;DR

This work addresses the computational challenge of attributing query results to individual relational tuples by introducing Relation-Stratified Sampling (RSS) and its adaptive variant ARSS. By stratifying coalitions according to relation-level counts rather than just coalition size, RSS leverages join structure to reduce variance and wasteful samples, while ARSS dynamically reallocates effort to high-variance strata to accelerate convergence. Through experiments on TPCH with multi-relation joins, RSS and ARSS consistently outperform classical Monte Carlo and size-based stratified sampling, achieving lower error with fewer samples and offering an anytime estimator suitable for interactive explanations. The approach is complemented by practical optimizations such as view-based compilation and caching, making Shapley-based attribution feasible for realistic relational workloads.

Abstract

Shapley-like values, including the Shapley and Banzhaf values, provide a principled way to quantify how individual tuples contribute to a query result. Their exact computation, however, is intractable because it requires aggregating marginal contributions over exponentially many permutations or subsets. While sampling-based estimators have been studied in cooperative game theory, their direct use for relational query answering remains underexplored and often ignores the structure of schemas and joins. We study tuple-level attribution for relational queries through sampling and introduce Relation-Stratified Sampling (RSS). Instead of stratifying coalitions only by size, RSS partitions the sample space by a relation-wise count vector that records how many tuples are drawn from each relation. This join-aware stratification concentrates samples on structurally valid and informative coalitions and avoids strata that cannot satisfy query conditions. We further develop an adaptive variant, ARSS, that reallocates budget across strata using variance estimates obtained during sampling, improving estimator efficiency without increasing the total number of samples. We analyze these estimators, describe a practical implementation that reuses compiled views to reduce per-sample query cost, and evaluate them on TPCH workloads. Across diverse queries with multi-relation joins and aggregates, RSS and ARSS consistently outperform classical Monte Carlo (MCS) and size-based Stratified Sampling (SS), yielding lower error and variance with fewer samples. An ablation shows that relation-aware stratification and adaptive allocation contribute complementary gains, making ARSS a simple, effective, and anytime estimator for database-centric Shapley attribution.

Relation-Stratified Sampling for Shapley Values Estimation in Relational Databases

TL;DR

This work addresses the computational challenge of attributing query results to individual relational tuples by introducing Relation-Stratified Sampling (RSS) and its adaptive variant ARSS. By stratifying coalitions according to relation-level counts rather than just coalition size, RSS leverages join structure to reduce variance and wasteful samples, while ARSS dynamically reallocates effort to high-variance strata to accelerate convergence. Through experiments on TPCH with multi-relation joins, RSS and ARSS consistently outperform classical Monte Carlo and size-based stratified sampling, achieving lower error with fewer samples and offering an anytime estimator suitable for interactive explanations. The approach is complemented by practical optimizations such as view-based compilation and caching, making Shapley-based attribution feasible for realistic relational workloads.

Abstract

Shapley-like values, including the Shapley and Banzhaf values, provide a principled way to quantify how individual tuples contribute to a query result. Their exact computation, however, is intractable because it requires aggregating marginal contributions over exponentially many permutations or subsets. While sampling-based estimators have been studied in cooperative game theory, their direct use for relational query answering remains underexplored and often ignores the structure of schemas and joins. We study tuple-level attribution for relational queries through sampling and introduce Relation-Stratified Sampling (RSS). Instead of stratifying coalitions only by size, RSS partitions the sample space by a relation-wise count vector that records how many tuples are drawn from each relation. This join-aware stratification concentrates samples on structurally valid and informative coalitions and avoids strata that cannot satisfy query conditions. We further develop an adaptive variant, ARSS, that reallocates budget across strata using variance estimates obtained during sampling, improving estimator efficiency without increasing the total number of samples. We analyze these estimators, describe a practical implementation that reuses compiled views to reduce per-sample query cost, and evaluate them on TPCH workloads. Across diverse queries with multi-relation joins and aggregates, RSS and ARSS consistently outperform classical Monte Carlo (MCS) and size-based Stratified Sampling (SS), yielding lower error and variance with fewer samples. An ablation shows that relation-aware stratification and adaptive allocation contribute complementary gains, making ARSS a simple, effective, and anytime estimator for database-centric Shapley attribution.

Paper Structure

This paper contains 22 sections, 12 equations, 2 figures, 10 tables.

Figures (2)

  • Figure 1: Convergence comparisons for 10L1.
  • Figure 2: Convergence in high-cost setting 21H3.