Table of Contents
Fetching ...

Delayed Bottlenecking: Alleviating Forgetting in Pre-trained Graph Neural Networks

Zhe Zhao, Pengkun Wang, Xu Wang, Haibin Wen, Xiaolong Xie, Zhengyang Zhou, Qingfu Zhang, Yang Wang

TL;DR

This work addresses the forgetting problem observed when pre-training graph neural networks (GNNs) for downstream tasks, arguing that traditional pre-training compresses information in ways that can be detrimental to transfer. It introduces Delayed Bottlenecking Pre-training (DBP), a principled framework that preserves mutual information $I(\mathcal{D}^{pre}; Z)$ during pre-training by suppressing compression and then applies compression during fine-tuning guided by labeled downstream data, under two information-control objectives. The authors formulate tractable variational upper bounds for these objectives and provide theoretical results showing improved parameter transfer between pre-training and fine-tuning. Empirically, DBP demonstrates strong gains over state-of-the-art pre-training methods on chemistry and biology benchmarks, with analyses revealing favorable information dynamics and robustness across several GNN architectures. Overall, DBP offers a principled, generalizable approach to bridging pre-training and fine-tuning in graph representation learning with potential impact across domains.

Abstract

Pre-training GNNs to extract transferable knowledge and apply it to downstream tasks has become the de facto standard of graph representation learning. Recent works focused on designing self-supervised pre-training tasks to extract useful and universal transferable knowledge from large-scale unlabeled data. However, they have to face an inevitable question: traditional pre-training strategies that aim at extracting useful information about pre-training tasks, may not extract all useful information about the downstream task. In this paper, we reexamine the pre-training process within traditional pre-training and fine-tuning frameworks from the perspective of Information Bottleneck (IB) and confirm that the forgetting phenomenon in pre-training phase may cause detrimental effects on downstream tasks. Therefore, we propose a novel \underline{D}elayed \underline{B}ottlenecking \underline{P}re-training (DBP) framework which maintains as much as possible mutual information between latent representations and training data during pre-training phase by suppressing the compression operation and delays the compression operation to fine-tuning phase to make sure the compression can be guided with labeled fine-tuning data and downstream tasks. To achieve this, we design two information control objectives that can be directly optimized and further integrate them into the actual model design. Extensive experiments on both chemistry and biology domains demonstrate the effectiveness of DBP.

Delayed Bottlenecking: Alleviating Forgetting in Pre-trained Graph Neural Networks

TL;DR

This work addresses the forgetting problem observed when pre-training graph neural networks (GNNs) for downstream tasks, arguing that traditional pre-training compresses information in ways that can be detrimental to transfer. It introduces Delayed Bottlenecking Pre-training (DBP), a principled framework that preserves mutual information during pre-training by suppressing compression and then applies compression during fine-tuning guided by labeled downstream data, under two information-control objectives. The authors formulate tractable variational upper bounds for these objectives and provide theoretical results showing improved parameter transfer between pre-training and fine-tuning. Empirically, DBP demonstrates strong gains over state-of-the-art pre-training methods on chemistry and biology benchmarks, with analyses revealing favorable information dynamics and robustness across several GNN architectures. Overall, DBP offers a principled, generalizable approach to bridging pre-training and fine-tuning in graph representation learning with potential impact across domains.

Abstract

Pre-training GNNs to extract transferable knowledge and apply it to downstream tasks has become the de facto standard of graph representation learning. Recent works focused on designing self-supervised pre-training tasks to extract useful and universal transferable knowledge from large-scale unlabeled data. However, they have to face an inevitable question: traditional pre-training strategies that aim at extracting useful information about pre-training tasks, may not extract all useful information about the downstream task. In this paper, we reexamine the pre-training process within traditional pre-training and fine-tuning frameworks from the perspective of Information Bottleneck (IB) and confirm that the forgetting phenomenon in pre-training phase may cause detrimental effects on downstream tasks. Therefore, we propose a novel \underline{D}elayed \underline{B}ottlenecking \underline{P}re-training (DBP) framework which maintains as much as possible mutual information between latent representations and training data during pre-training phase by suppressing the compression operation and delays the compression operation to fine-tuning phase to make sure the compression can be guided with labeled fine-tuning data and downstream tasks. To achieve this, we design two information control objectives that can be directly optimized and further integrate them into the actual model design. Extensive experiments on both chemistry and biology domains demonstrate the effectiveness of DBP.
Paper Structure (18 sections, 7 theorems, 33 equations, 6 figures, 4 tables)

This paper contains 18 sections, 7 theorems, 33 equations, 6 figures, 4 tables.

Key Result

Lemma 1

According to the research of shwartz2017opening, in the normal training process, the mutual information $I(X;Z)$ between input data $X$ and latent representation $Z$ first increases and then decreases in the early stage of training, while the mutual information $I(X;Y)$ between input data $X$ and o

Figures (6)

  • Figure 1: Information-theoretic analysis of conventional and delayed bottlenecking pre-training in graph neural networks. Subfigure (a) presents the dynamics of information encoding in latent space during conventional pre-training, denoted as $Z$, relative to the pre-training data $X$ and associated task $Y$, and its subsequent impact on downstream task $Y'$. In this regime, the latent representation $Z$ undergoes a compression process, optimized for $Y$, which inadvertently discards non-salient features for $Y$ but may be pertinent to $Y'$, thereby diminishing the mutual information $I(Z; Y')$ post-compression. Subfigure (b) depicts an alternative approach with the proposed Delayed Bottlenecking Pre-Training, where the compression of $Z$ during the pre-training phase is deliberately modulated. This control preserves a broader set of features in $Z$, allowing for enhanced mutual information $I(Z; Y')$ post-fine-tuning, which is refined under the guidance of labeled data specific to $Y'$.
  • Figure 2: Architecture of DBP framework. Subfigure (a) corresponds to the generative and contrastive learning based self-supervised pre-training model. The optimization objective of pre-training consists of $L_{con}$ and $L_{pi}$ which are respectively used to extract general knowledge and avoid excessive information compression. Subfigure (b) indicates the information control based fine-tuning model. The optimization objective of fine-tuning, which is composed of $L_{cls}$ and $L_{fi}$, encourages enhanced information compression to improve classification performance. The two-phase transition is implemented by means of parameter transfer.
  • Figure 3: Dynamics of the mutual information $I(Y, Z)$ between the target labels $Y$ and the learned representations $Z$ across training epochs for different variants on two molecular property prediction datasets (BBBP and SIDER).
  • Figure 4: ROC-AUC curves across training epochs for different variants of the proposed DBP method on SIDER and ClinTox.
  • Figure 5: Hyperparameter sensitivity analysis and ablation study with respect to DBP. Subfigure (a) shows our ablation experiments on the information control modules during the pre-training and fine-tuning stages. Subfigure (b) illustrates our analysis experiments on the relationship between the information control hyperparameters $\alpha$ and $\beta$ and model performance across three datasets during the pre-training and fine-tuning stages.
  • ...and 1 more figures

Theorems & Definitions (8)

  • Lemma 1: Representation Forgetting
  • Theorem 1: Pre-training Information Transfer
  • Proposition 1: Upper bound of $\mathcal{L}_{pi}$
  • Proposition 2: Upper bound of $\mathcal{L}_{fine}$
  • Definition 1: KL Divergence
  • Lemma 2: Chain Rule of KL Divergence
  • Lemma 3: Non-Negativity of KL Divergence
  • Theorem 2: Bounding Posterior Distributions via DBP