Graph Data Selection for Domain Adaptation: A Model-Free Approach
Ting-Wei Li, Ruizhong Qiu, Hanghang Tong
TL;DR
GraDate delivers a model-free data selection framework for graph-domain adaptation by introducing Graph Dataset Distance (GDD), built on the Fused Gromov-Wasserstein distance to compare graphs while optionally incorporating labels. The GREAT algorithm optimizes a sparse weighting over training graphs to minimize GDD between weighted source and validation sets, yielding a high-quality data subset that enhances both data-efficient selection and downstream GDA methods. Theoretical results bound the target risk by the source risk plus a GDD term, justifying the data-selection objective, while extensive experiments across six real-world graph datasets show GraDate often outperforms model-centric baselines and consistently boosts off-the-shelf GNNs and GDA methods with far less data. Overall, GraDate offers a universal, scalable, and data-efficient approach that complements existing GDA strategies by prioritizing training data quality under distribution shift.
Abstract
Graph domain adaptation (GDA) is a fundamental task in graph machine learning, with techniques like shift-robust graph neural networks (GNNs) and specialized training procedures to tackle the distribution shift problem. Although these model-centric approaches show promising results, they often struggle with severe shifts and constrained computational resources. To address these challenges, we propose a novel model-free framework, GRADATE (GRAph DATa sElector), that selects the best training data from the source domain for the classification task on the target domain. GRADATE picks training samples without relying on any GNN model's predictions or training recipes, leveraging optimal transport theory to capture and adapt to distribution changes. GRADATE is data-efficient, scalable and meanwhile complements existing model-centric GDA approaches. Through comprehensive empirical studies on several real-world graph-level datasets and multiple covariate shift types, we demonstrate that GRADATE outperforms existing selection methods and enhances off-the-shelf GDA methods with much fewer training data.
