GATE: How to Keep Out Intrusive Neighbors

Nimrah Mustafa; Rebekka Burkholz

GATE: How to Keep Out Intrusive Neighbors

Nimrah Mustafa, Rebekka Burkholz

TL;DR

GATE addresses a key limitation of Graph Attention Networks: their inability to selectively switch off task-irrelevant neighborhood aggregation, which harms learning on deep GNNs and heterophilic graphs. The authors extend GAT to GATE by separating budgets for node-feature and neighborhood contributions, grounded in a gradient-flow conservation framework that enables switching off aggregation in well-trained regimes. They provide theoretical insights, a synthetic test bed, and extensive experiments showing GATE outperforms GAT and many baselines, achieving state-of-the-art results on ogb-arxiv and strong performance on heterophilic real-world data. This work demonstrates a flexible, depth-friendly approach to graph learning with interpretable attention patterns and practical impact for diverse graph-structured tasks.

Abstract

Graph Attention Networks (GATs) are designed to provide flexible neighborhood aggregation that assigns weights to neighbors according to their importance. In practice, however, GATs are often unable to switch off task-irrelevant neighborhood aggregation, as we show experimentally and analytically. To address this challenge, we propose GATE, a GAT extension that holds three major advantages: i) It alleviates over-smoothing by addressing its root cause of unnecessary neighborhood aggregation. ii) Similarly to perceptrons, it benefits from higher depth as it can still utilize additional layers for (non-)linear feature transformations in case of (nearly) switched-off neighborhood aggregation. iii) By down-weighting connections to unrelated neighbors, it often outperforms GATs on real-world heterophilic datasets. To further validate our claims, we construct a synthetic test bed to analyze a model's ability to utilize the appropriate amount of neighborhood aggregation, which could be of independent interest.

GATE: How to Keep Out Intrusive Neighbors

TL;DR

Abstract

Paper Structure (29 sections, 3 theorems, 26 equations, 14 figures, 11 tables)

This paper contains 29 sections, 3 theorems, 26 equations, 14 figures, 11 tables.

Introduction
Related Work
Architecture
Notation
GAT
GATE
Theoretical Insights
Experiments
Synthetic Test Bed
Learning self-sufficient node labels
Learning neighbor-dependent node labels
Real-World Data
Conclusion
Theoretical Derivations
Derivation of Insight \ref{['insight:GAT']}
...and 14 more sections

Key Result

Theorem 4.1

The parameters $\theta$ of a layer $l\in[L-1]$ in a GAT network and their gradients $\nabla_{\theta} \mathcal{L}$ w.r.t. loss $\mathcal{L}$ fulfill:

Figures (14)

Figure 1: MLP only performs node feature transformations, whereas GAT also always aggregates over the neighborhood. With the ability to switch off neighborhood aggregation, GATE can learn to emulate MLP behavior and potentially interleave effective perceptron and standard layers in a flexible manner. This allows for more expressive power that we find to benefit real-world tasks (see Table \ref{['real-het-results']}).
Figure 2: Examples of synthetic input graphs constructed for learning tasks that are (a) self-sufficient and can be better solved by switching off neighborhood aggregation, i.e. $\alpha_{vv}=1$ and (b) neighbor-dependent that benefit from ignoring the node's own features, i.e. $\alpha_{vv}=0$. In both cases, $\forall$$v\in \mathbb{V}$, $\sum_{u\in\mathbb{N}(v),u\neq v} \alpha_{uv} + \alpha_{vv}=1$. These represent opposite ends of the spectrum whereas real-world tasks often lie in between and require $\alpha_{ii}\in [0,1]$. GATE's attention mechanism is more flexible than GAT's in learning the level of neighborhood aggregation required for a task.
Figure 3: Distribution of $\alpha_{vv}$ against training epoch for self-sufficient learning problem using Cora structure and input node features as the one-hot encoding of labels for $1$ and $2$ layer models. Due to space limitation, we defer the plots of $5$ layer networks to Fig. \ref{['alphaDist-cora-oneHotFeats-5-layer']} in Appendix \ref{['appendix:additional-results']}.
Figure 4: (a)-(c): Distribution of node labels of a synthetic dataset, with neighbor-dependent node labels, based on nodes' own random features (left) and neighbors' features aggregated $k$ times (right).
Figure 5: Distribution of $\alpha_{vv}$ against training epoch for the neighbor-dependent learning problem with $k=3$. Rows: GAT (top) and GATE (bottom) architecture. Columns (left to right): $3$, $4$, and $5$ layer models. While GAT is unable to switch off neighborhood aggregation in any layer, only $3$ layers of the $4$ and $5$ layer models perform neighborhood aggregation.
...and 9 more figures

Theorems & Definitions (3)

Theorem 4.1: Thm. 2.2 by balanceGATs
Theorem 4.3: Structure of GATE gradients
Theorem 1.1: Structure of GATE$_S$ gradients

GATE: How to Keep Out Intrusive Neighbors

TL;DR

Abstract

GATE: How to Keep Out Intrusive Neighbors

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (14)

Theorems & Definitions (3)