Table of Contents
Fetching ...

Pre-training Graph Neural Networks on Molecules by Using Subgraph-Conditioned Graph Information Bottleneck

Van Thuy Hoang, O-Joun Lee

TL;DR

A novel Subgraph-conditioned Graph Information Bottleneck, named S-CGIB, is proposed for pre-training GNNs to recognize core subgraphs (graph cores) and significant subgraphs, and a set of functional group candidates are proposed, i.e., ego networks, and an attention-based interaction between the graph core and the candidates.

Abstract

This study aims to build a pre-trained Graph Neural Network (GNN) model on molecules without human annotations or prior knowledge. Although various attempts have been proposed to overcome limitations in acquiring labeled molecules, the previous pre-training methods still rely on semantic subgraphs, i.e., functional groups. Only focusing on the functional groups could overlook the graph-level distinctions. The key challenge to build a pre-trained GNN on molecules is how to (1) generate well-distinguished graph-level representations and (2) automatically discover the functional groups without prior knowledge. To solve it, we propose a novel Subgraph-conditioned Graph Information Bottleneck, named S-CGIB, for pre-training GNNs to recognize core subgraphs (graph cores) and significant subgraphs. The main idea is that the graph cores contain compressed and sufficient information that could generate well-distinguished graph-level representations and reconstruct the input graph conditioned on significant subgraphs across molecules under the S-CGIB principle. To discover significant subgraphs without prior knowledge about functional groups, we propose generating a set of functional group candidates, i.e., ego networks, and using an attention-based interaction between the graph core and the candidates. Despite being identified from self-supervised learning, our learned subgraphs match the real-world functional groups. Extensive experiments on molecule datasets across various domains demonstrate the superiority of S-CGIB.

Pre-training Graph Neural Networks on Molecules by Using Subgraph-Conditioned Graph Information Bottleneck

TL;DR

A novel Subgraph-conditioned Graph Information Bottleneck, named S-CGIB, is proposed for pre-training GNNs to recognize core subgraphs (graph cores) and significant subgraphs, and a set of functional group candidates are proposed, i.e., ego networks, and an attention-based interaction between the graph core and the candidates.

Abstract

This study aims to build a pre-trained Graph Neural Network (GNN) model on molecules without human annotations or prior knowledge. Although various attempts have been proposed to overcome limitations in acquiring labeled molecules, the previous pre-training methods still rely on semantic subgraphs, i.e., functional groups. Only focusing on the functional groups could overlook the graph-level distinctions. The key challenge to build a pre-trained GNN on molecules is how to (1) generate well-distinguished graph-level representations and (2) automatically discover the functional groups without prior knowledge. To solve it, we propose a novel Subgraph-conditioned Graph Information Bottleneck, named S-CGIB, for pre-training GNNs to recognize core subgraphs (graph cores) and significant subgraphs. The main idea is that the graph cores contain compressed and sufficient information that could generate well-distinguished graph-level representations and reconstruct the input graph conditioned on significant subgraphs across molecules under the S-CGIB principle. To discover significant subgraphs without prior knowledge about functional groups, we propose generating a set of functional group candidates, i.e., ego networks, and using an attention-based interaction between the graph core and the candidates. Despite being identified from self-supervised learning, our learned subgraphs match the real-world functional groups. Extensive experiments on molecule datasets across various domains demonstrate the superiority of S-CGIB.

Paper Structure

This paper contains 42 sections, 25 equations, 6 figures, 11 tables.

Figures (6)

  • Figure 1: The overall architecture of S-CGIB.
  • Figure 2: An efficiency analysis for variants of S-CGIB. The solid lines are training curves, and the dashed lines are validation curves (PT: Pre-training, D.A.: Domain Adaptation).
  • Figure 3: Visualizations of model interpretability in functional group detection tasks.
  • Figure 4: Performance according to weighting factor $\zeta$ for the term $I\left(G ; S \right )$ in Eq. 14.
  • Figure 5: Performance according to subgraph sizes ($k$).
  • ...and 1 more figures

Theorems & Definitions (3)

  • Definition 1: GIB
  • Definition 2: CGIB
  • Definition 3: S-CGIB