Table of Contents
Fetching ...

Graph Data Selection for Domain Adaptation: A Model-Free Approach

Ting-Wei Li, Ruizhong Qiu, Hanghang Tong

TL;DR

GraDate delivers a model-free data selection framework for graph-domain adaptation by introducing Graph Dataset Distance (GDD), built on the Fused Gromov-Wasserstein distance to compare graphs while optionally incorporating labels. The GREAT algorithm optimizes a sparse weighting over training graphs to minimize GDD between weighted source and validation sets, yielding a high-quality data subset that enhances both data-efficient selection and downstream GDA methods. Theoretical results bound the target risk by the source risk plus a GDD term, justifying the data-selection objective, while extensive experiments across six real-world graph datasets show GraDate often outperforms model-centric baselines and consistently boosts off-the-shelf GNNs and GDA methods with far less data. Overall, GraDate offers a universal, scalable, and data-efficient approach that complements existing GDA strategies by prioritizing training data quality under distribution shift.

Abstract

Graph domain adaptation (GDA) is a fundamental task in graph machine learning, with techniques like shift-robust graph neural networks (GNNs) and specialized training procedures to tackle the distribution shift problem. Although these model-centric approaches show promising results, they often struggle with severe shifts and constrained computational resources. To address these challenges, we propose a novel model-free framework, GRADATE (GRAph DATa sElector), that selects the best training data from the source domain for the classification task on the target domain. GRADATE picks training samples without relying on any GNN model's predictions or training recipes, leveraging optimal transport theory to capture and adapt to distribution changes. GRADATE is data-efficient, scalable and meanwhile complements existing model-centric GDA approaches. Through comprehensive empirical studies on several real-world graph-level datasets and multiple covariate shift types, we demonstrate that GRADATE outperforms existing selection methods and enhances off-the-shelf GDA methods with much fewer training data.

Graph Data Selection for Domain Adaptation: A Model-Free Approach

TL;DR

GraDate delivers a model-free data selection framework for graph-domain adaptation by introducing Graph Dataset Distance (GDD), built on the Fused Gromov-Wasserstein distance to compare graphs while optionally incorporating labels. The GREAT algorithm optimizes a sparse weighting over training graphs to minimize GDD between weighted source and validation sets, yielding a high-quality data subset that enhances both data-efficient selection and downstream GDA methods. Theoretical results bound the target risk by the source risk plus a GDD term, justifying the data-selection objective, while extensive experiments across six real-world graph datasets show GraDate often outperforms model-centric baselines and consistently boosts off-the-shelf GNNs and GDA methods with far less data. Overall, GraDate offers a universal, scalable, and data-efficient approach that complements existing GDA strategies by prioritizing training data quality under distribution shift.

Abstract

Graph domain adaptation (GDA) is a fundamental task in graph machine learning, with techniques like shift-robust graph neural networks (GNNs) and specialized training procedures to tackle the distribution shift problem. Although these model-centric approaches show promising results, they often struggle with severe shifts and constrained computational resources. To address these challenges, we propose a novel model-free framework, GRADATE (GRAph DATa sElector), that selects the best training data from the source domain for the classification task on the target domain. GRADATE picks training samples without relying on any GNN model's predictions or training recipes, leveraging optimal transport theory to capture and adapt to distribution changes. GRADATE is data-efficient, scalable and meanwhile complements existing model-centric GDA approaches. Through comprehensive empirical studies on several real-world graph-level datasets and multiple covariate shift types, we demonstrate that GRADATE outperforms existing selection methods and enhances off-the-shelf GDA methods with much fewer training data.

Paper Structure

This paper contains 76 sections, 4 theorems, 22 equations, 2 figures, 14 tables, 4 algorithms.

Key Result

Theorem 3.1

Given two graphs $\mathcal{G}_1=(\mathbf A_1,\mathbf X_1)$ and $\mathcal{G}_2=(\mathbf A_2,\mathbf X_2)$, for a $k$-layer graph neural network (GNN) $f$ with ReLU activations, under regularity assumptions in Appendix appendix:ass:fgw, we have where $d_{\textnormal{W}}$ denotes the $r$-Wasserstein distance, $C$ and $\beta$ are constants depending on GNN $f$, regularity constants and $k$.

Figures (2)

  • Figure 1: ECDF plots of graph density and size for IMDB-BINARY, IMDB-MULTI, and MSRC_21 datasets. The Blue, Orange, and Green curves represent the distributions of the training, validation, and test splits, respectively. Graphs are sorted in the ascending order by the specified shift (density or size).
  • Figure 2: ECDF plots of graph density and size for ogbg-molbbbp, ogbg-molbace, and ogbg-molhiv datasets. The Blue, Orange, and Green curves represent the distributions of the training, validation, and test splits, respectively. Graphs are sorted in ascending order by the specified shift (density or size).

Theorems & Definitions (13)

  • Theorem 3.1
  • proof
  • Remark 3.2
  • Theorem 3.3: Graph Domain Generalization Gap
  • proof
  • Definition 3.4: Graph Dataset Distance Minimization
  • Proposition 3.5: Time Complexity Analysis alvarez2020geometricjust2023lavaaltschuler2017near
  • proof
  • proof
  • Remark F.1
  • ...and 3 more