Table of Contents
Fetching ...

GETS: Ensemble Temperature Scaling for Calibration in Graph Neural Networks

Dingyi Zhuang, Chonghe Jiang, Yunhan Zheng, Shenhao Wang, Jinhua Zhao

TL;DR

This work tackles the challenge of miscalibrated probabilities in graph neural networks by introducing GETS, a Graph Ensemble Temperature Scaling framework that blends input and model ensembles within a Graph MoE to achieve node-wise calibration. GETS jointly leverages multiple information sources—logits, node features, and degree information—through specialized calibration experts and a sparse gating mechanism that selects the most relevant experts per node. Empirically, GETS reduces the expected calibration error by about $25\%$ across 10 GNN benchmark datasets and demonstrates scalability, outperforming state-of-the-art methods such as CaGCN, GATS, and ETS, with some exceptions on certain datasets. The approach also provides insights into expert selection, input ablations, and potential fairness benefits across degree groups, highlighting practical impact for reliable uncertainty estimates in graph-based predictions.

Abstract

Graph Neural Networks deliver strong classification results but often suffer from poor calibration performance, leading to overconfidence or underconfidence. This is particularly problematic in high stakes applications where accurate uncertainty estimates are essential. Existing post hoc methods, such as temperature scaling, fail to effectively utilize graph structures, while current GNN calibration methods often overlook the potential of leveraging diverse input information and model ensembles jointly. In the paper, we propose Graph Ensemble Temperature Scaling, a novel calibration framework that combines input and model ensemble strategies within a Graph Mixture of Experts archi SOTA calibration techniques, reducing expected calibration error by 25 percent across 10 GNN benchmark datasets. Additionally, GETS is computationally efficient, scalable, and capable of selecting effective input combinations for improved calibration performance. The implementation is available via Github.

GETS: Ensemble Temperature Scaling for Calibration in Graph Neural Networks

TL;DR

This work tackles the challenge of miscalibrated probabilities in graph neural networks by introducing GETS, a Graph Ensemble Temperature Scaling framework that blends input and model ensembles within a Graph MoE to achieve node-wise calibration. GETS jointly leverages multiple information sources—logits, node features, and degree information—through specialized calibration experts and a sparse gating mechanism that selects the most relevant experts per node. Empirically, GETS reduces the expected calibration error by about across 10 GNN benchmark datasets and demonstrates scalability, outperforming state-of-the-art methods such as CaGCN, GATS, and ETS, with some exceptions on certain datasets. The approach also provides insights into expert selection, input ablations, and potential fairness benefits across degree groups, highlighting practical impact for reliable uncertainty estimates in graph-based predictions.

Abstract

Graph Neural Networks deliver strong classification results but often suffer from poor calibration performance, leading to overconfidence or underconfidence. This is particularly problematic in high stakes applications where accurate uncertainty estimates are essential. Existing post hoc methods, such as temperature scaling, fail to effectively utilize graph structures, while current GNN calibration methods often overlook the potential of leveraging diverse input information and model ensembles jointly. In the paper, we propose Graph Ensemble Temperature Scaling, a novel calibration framework that combines input and model ensemble strategies within a Graph Mixture of Experts archi SOTA calibration techniques, reducing expected calibration error by 25 percent across 10 GNN benchmark datasets. Additionally, GETS is computationally efficient, scalable, and capable of selecting effective input combinations for improved calibration performance. The implementation is available via Github.

Paper Structure

This paper contains 33 sections, 20 equations, 7 figures, 12 tables.

Figures (7)

  • Figure 1: Expected calibration error (ECE), see Equation \ref{['eq:ECE']}, for a well-trained CaGCN model wang2021confident. ECE is measured by grouping nodes based on degree ranges rather than predicted confidence. Datasets are sorted by average degree $\frac{2|\mathcal{E}|}{|\mathcal{V}|}$ as a measure of connectivity (see Table \ref{['tab:datasets']}), from low to high. In lower-connectivity datasets like Pubmed and Cora-full, high-degree nodes tend to show larger calibration errors, whereas in more connected datasets like Photo and Ogbn-arxiv, low-degree nodes exhibit higher calibration errors.
  • Figure 2: Illustration of input and model ensemble calibration. The input ensemble explores different combinations of input types, while the model ensemble employs a MoE framework to select the most effective experts for calibration. The final calibrated outputs are weighted averages determined by the gating mechanism. The notation $(\cdot)$ indicates different input types for the function.
  • Figure 3: Illustration of computational efficiency and expert selection properties. (a): Elapsed time for training each model for 10 runs; (b): Primary and secondary expert selections across datasets for various input ensembles. The top bar plot shows the frequency of expert selection, highlighting the significance of combining logits and features in calibration across datasets.
  • Figure 4: Reliability diagrams of GaGCN, GATS, GETS, and TS.
  • Figure 5: Reliability diagrams of VS and ETS.
  • ...and 2 more figures