Table of Contents
Fetching ...

Hybrid Focal and Full-Range Attention Based Graph Transformers

Minhong Zhu, Zhenhao Zhao, Weiran Cai

TL;DR

This paper presents a purely attention-based architecture, namely Focal and Full-Range Graph Transformer (FFGT), which can mitigate the loss of local information in learning global correlations and enhances the performance of existing Graph Transformers on various open datasets, while achieves compatible SOTA performance on several Long-range Graph Benchmark datasets even with a vanilla transformer.

Abstract

The paradigm of Transformers using the self-attention mechanism has manifested its advantage in learning graph-structured data. Yet, Graph Transformers are capable of modeling full range dependencies but are often deficient in extracting information from locality. A common practice is to utilize Message Passing Neural Networks (MPNNs) as an auxiliary to capture local information, which however are still inadequate for comprehending substructures. In this paper, we present a purely attention-based architecture, namely Focal and Full-Range Graph Transformer (FFGT), which can mitigate the loss of local information in learning global correlations. The core component of FFGT is a new mechanism of compound attention, which combines the conventional full-range attention with K-hop focal attention on ego-nets to aggregate both global and local information. Beyond the scope of canonical Transformers, the FFGT has the merit of being more substructure-aware. Our approach enhances the performance of existing Graph Transformers on various open datasets, while achieves compatible SOTA performance on several Long-Range Graph Benchmark (LRGB) datasets even with a vanilla transformer. We further examine influential factors on the optimal focal length of attention via introducing a novel synthetic dataset based on SBM-PATTERN.

Hybrid Focal and Full-Range Attention Based Graph Transformers

TL;DR

This paper presents a purely attention-based architecture, namely Focal and Full-Range Graph Transformer (FFGT), which can mitigate the loss of local information in learning global correlations and enhances the performance of existing Graph Transformers on various open datasets, while achieves compatible SOTA performance on several Long-range Graph Benchmark datasets even with a vanilla transformer.

Abstract

The paradigm of Transformers using the self-attention mechanism has manifested its advantage in learning graph-structured data. Yet, Graph Transformers are capable of modeling full range dependencies but are often deficient in extracting information from locality. A common practice is to utilize Message Passing Neural Networks (MPNNs) as an auxiliary to capture local information, which however are still inadequate for comprehending substructures. In this paper, we present a purely attention-based architecture, namely Focal and Full-Range Graph Transformer (FFGT), which can mitigate the loss of local information in learning global correlations. The core component of FFGT is a new mechanism of compound attention, which combines the conventional full-range attention with K-hop focal attention on ego-nets to aggregate both global and local information. Beyond the scope of canonical Transformers, the FFGT has the merit of being more substructure-aware. Our approach enhances the performance of existing Graph Transformers on various open datasets, while achieves compatible SOTA performance on several Long-Range Graph Benchmark (LRGB) datasets even with a vanilla transformer. We further examine influential factors on the optimal focal length of attention via introducing a novel synthetic dataset based on SBM-PATTERN.
Paper Structure (23 sections, 7 equations, 3 figures, 5 tables)

This paper contains 23 sections, 7 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: FFGT layer of compound attention consisted of compensating scopes: the full-range attention block is a fully-connected attention block for obtaining global correlations, while the focal attention block is an attention module delimited to local ego-nets for comprehending substructural information. Each attention block is comprised of a number of attention heads (M and N respectively). The output of each layer is concatenated and normalized (Layer Normalization here) to allow integration from different scopes. Edge features are used by the two attention blocks separately.
  • Figure 2: Focal Attention Block. (a) Focal Attention prunes high correlations over long ranges (green lines), allocating more focus to the locality (blue lines); (b) Example of a local ego-net with $FL=2$. Nodes (black) outside the circle (i.e. with distance greater than 2 hops) are excluded from attention computation; (c) Detailed architecture inside the focal attention block. Focal Mask $FM \in \mathbb{R} ^{n\times n}$ is employed to exclude nodes beyond the ego-net range, with $FM_{ij} = 1$ only when node $j$ belongs to the K-hop ego-net centered on node $i$ and $FM_{ij} = 0$ otherwise.
  • Figure 3: Ablation study with the focal length ($FL$) on ZINC and Peptide-Functional datasets. Vanilla-FFGT is used here as a backbone model. The horizontal axis represents the focal length ($FL$) and "Vanilla" refers to the fully-connected version.