Table of Contents
Fetching ...

Enhancing Transformer with GNN Structural Knowledge via Distillation: A Novel Approach

Zhihua Duan, Jialin Wang

TL;DR

This work tackles the challenge of combining the local structural priors of GNNs with the global modeling of Transformers by proposing a cross-architectural distillation framework. It introduces a multiscale distillation strategy that transfers GNN structural knowledge to a Transformer student via micro-structure, macro-structure, and multi-scale feature alignment, balanced by an adaptive objective. The method is evaluated on the Citeseer dataset with various GNN teachers, showing that the Transformer student can inherit and utilize the graph priors to achieve competitive performance (e.g., $74.5$ with a GCN teacher and $72.56$ with a GAT teacher). This cross-architectural distillation framework offers a new paradigm for graph-aware Transformers, with potential broad impact on graph representation learning and downstream tasks requiring both local topology and global context.

Abstract

Integrating the structural inductive biases of Graph Neural Networks (GNNs) with the global contextual modeling capabilities of Transformers represents a pivotal challenge in graph representation learning. While GNNs excel at capturing localized topological patterns through message-passing mechanisms, their inherent limitations in modeling long-range dependencies and parallelizability hinder their deployment in large-scale scenarios. Conversely, Transformers leverage self-attention mechanisms to achieve global receptive fields but struggle to inherit the intrinsic graph structural priors of GNNs. This paper proposes a novel knowledge distillation framework that systematically transfers multiscale structural knowledge from GNN teacher models to Transformer student models, offering a new perspective on addressing the critical challenges in cross-architectural distillation. The framework effectively bridges the architectural gap between GNNs and Transformers through micro-macro distillation losses and multiscale feature alignment. This work establishes a new paradigm for inheriting graph structural biases in Transformer architectures, with broad application prospects.

Enhancing Transformer with GNN Structural Knowledge via Distillation: A Novel Approach

TL;DR

This work tackles the challenge of combining the local structural priors of GNNs with the global modeling of Transformers by proposing a cross-architectural distillation framework. It introduces a multiscale distillation strategy that transfers GNN structural knowledge to a Transformer student via micro-structure, macro-structure, and multi-scale feature alignment, balanced by an adaptive objective. The method is evaluated on the Citeseer dataset with various GNN teachers, showing that the Transformer student can inherit and utilize the graph priors to achieve competitive performance (e.g., with a GCN teacher and with a GAT teacher). This cross-architectural distillation framework offers a new paradigm for graph-aware Transformers, with potential broad impact on graph representation learning and downstream tasks requiring both local topology and global context.

Abstract

Integrating the structural inductive biases of Graph Neural Networks (GNNs) with the global contextual modeling capabilities of Transformers represents a pivotal challenge in graph representation learning. While GNNs excel at capturing localized topological patterns through message-passing mechanisms, their inherent limitations in modeling long-range dependencies and parallelizability hinder their deployment in large-scale scenarios. Conversely, Transformers leverage self-attention mechanisms to achieve global receptive fields but struggle to inherit the intrinsic graph structural priors of GNNs. This paper proposes a novel knowledge distillation framework that systematically transfers multiscale structural knowledge from GNN teacher models to Transformer student models, offering a new perspective on addressing the critical challenges in cross-architectural distillation. The framework effectively bridges the architectural gap between GNNs and Transformers through micro-macro distillation losses and multiscale feature alignment. This work establishes a new paradigm for inheriting graph structural biases in Transformer architectures, with broad application prospects.

Paper Structure

This paper contains 22 sections, 10 equations, 1 figure, 1 table.

Figures (1)

  • Figure 1: Knowledge Distillation from GNNs to Transformers .