Table of Contents
Fetching ...

Shapley-Guided Utility Learning for Effective Graph Inference Data Valuation

Hongliang Chi, Qiong Wu, Zhengyi Zhou, Yao Ma

TL;DR

This work formulates graph inference data valuation as a test-time problem where ground-truth labels are unavailable. It introduces SGUL, which combines transferable data- and model-specific features with a Shapley-guided optimization to directly predict Shapley values for test-time neighbors, enabling efficient valuation without labels. The method rests on a Structure-Aware Shapley formulation and a Shapley Value Decomposition for linear utilities, linking learned weights to feature Shapley values and resulting in a sparse, interpretable model. Empirical results on seven real-world datasets and a large-scale ogbn-arxiv study show SGUL outperforms baselines in both inductive and transductive settings, with favorable efficiency and robustness across graph structures. The approach offers a practical, scalable pathway to identify influential test-time neighbors for graph inference tasks with real-world applicability in dynamic graphs and real-time decision-making.

Abstract

Graph Neural Networks (GNNs) have demonstrated remarkable performance in various graph-based machine learning tasks, yet evaluating the importance of neighbors of testing nodes remains largely unexplored due to the challenge of assessing data importance without test labels. To address this gap, we propose Shapley-Guided Utility Learning (SGUL), a novel framework for graph inference data valuation. SGUL innovatively combines transferable data-specific and modelspecific features to approximate test accuracy without relying on ground truth labels. By incorporating Shapley values as a preprocessing step and using feature Shapley values as input, our method enables direct optimization of Shapley value prediction while reducing computational demands. SGUL overcomes key limitations of existing methods, including poor generalization to unseen test-time structures and indirect optimization. Experiments on diverse graph datasets demonstrate that SGUL consistently outperforms existing baselines in both inductive and transductive settings. SGUL offers an effective, efficient, and interpretable approach for quantifying the value of test-time neighbors.

Shapley-Guided Utility Learning for Effective Graph Inference Data Valuation

TL;DR

This work formulates graph inference data valuation as a test-time problem where ground-truth labels are unavailable. It introduces SGUL, which combines transferable data- and model-specific features with a Shapley-guided optimization to directly predict Shapley values for test-time neighbors, enabling efficient valuation without labels. The method rests on a Structure-Aware Shapley formulation and a Shapley Value Decomposition for linear utilities, linking learned weights to feature Shapley values and resulting in a sparse, interpretable model. Empirical results on seven real-world datasets and a large-scale ogbn-arxiv study show SGUL outperforms baselines in both inductive and transductive settings, with favorable efficiency and robustness across graph structures. The approach offers a practical, scalable pathway to identify influential test-time neighbors for graph inference tasks with real-world applicability in dynamic graphs and real-time decision-making.

Abstract

Graph Neural Networks (GNNs) have demonstrated remarkable performance in various graph-based machine learning tasks, yet evaluating the importance of neighbors of testing nodes remains largely unexplored due to the challenge of assessing data importance without test labels. To address this gap, we propose Shapley-Guided Utility Learning (SGUL), a novel framework for graph inference data valuation. SGUL innovatively combines transferable data-specific and modelspecific features to approximate test accuracy without relying on ground truth labels. By incorporating Shapley values as a preprocessing step and using feature Shapley values as input, our method enables direct optimization of Shapley value prediction while reducing computational demands. SGUL overcomes key limitations of existing methods, including poor generalization to unseen test-time structures and indirect optimization. Experiments on diverse graph datasets demonstrate that SGUL consistently outperforms existing baselines in both inductive and transductive settings. SGUL offers an effective, efficient, and interpretable approach for quantifying the value of test-time neighbors.

Paper Structure

This paper contains 62 sections, 2 theorems, 13 equations, 4 figures, 10 tables, 4 algorithms.

Key Result

Theorem 1

Given a linear utility function $U(S) = \mathbf{w}^\top \mathbf{x}(S)$, where $\mathbf{w} \in \mathbb{R}^d$ is a parameter vector and $\mathbf{x}(S) \in \mathbb{R}^d$ is a feature vector representing subset $S$, the Shapley value of player $i$ with respect to $U$ can be expressed as a linear combina where $\boldsymbol{\psi}_i = [\phi_i(U_1), \phi_i(U_2), \ldots, \phi_i(U_d)]^\top$ is the vector of

Figures (4)

  • Figure 1: Accuracy curves for node dropping experiments using the SGC model on various datasets in the inductive setting. Our proposed SGUL method consistently maintains higher accuracy as nodes are removed, indicating its effectiveness in identifying important nodes. Note that GNNEvaluator is not shown for the larger datasets due to Out of Memory (OOM) errors.
  • Figure 2: Accuracy curves for node dropping experiments using the GCN model on various datasets in the inductive setting. Similar to the SGC results, our proposed SGUL method demonstrates superior performance in maintaining higher accuracy as nodes are removed. Note that GNNEvaluator is not shown for the larger datasets due to Out of Memory (OOM) errors.
  • Figure 3: Accuracy curves for node dropping experiments using the SGC (above) and GCN models (below) in the transductive setting.
  • Figure 4: Accuracy curves for node dropping experiments on the ogbn-arxiv dataset using the SGC model in the inductive setting. Our proposed SGUL method demonstrates superior performance, achieving a steeper decline in accuracy as high-value nodes are removed.

Theorems & Definitions (4)

  • Definition 1: Graph Inference Data Valuation
  • Theorem 1: Shapley Value Decomposition
  • Theorem 2: Shapley Value Decomposition
  • proof