Table of Contents
Fetching ...

Enhance GNNs with Reliable Confidence Estimation via Adversarial Calibration Learning

Yilong Wang, Jiahao Zhang, Tianxiang Zhao, Suhang Wang

TL;DR

This work tackles the challenge of poorly calibrated GNN predictions in graph-structured data, where global calibration methods fail to ensure reliable confidence across node subgroups. It introduces AdvCali, an adversarial calibration framework that jointly learns node-wise temperature scaling and an adversarial group detector to identify miscalibrated subgroups, guided by a differentiable Group-ECE loss. The method demonstrates strong improvements in both global and subgroup calibration across eight real-world benchmarks and remains effective across different backbones, with ablations confirming the necessity of both the cross-entropy and Group-ECE components. By automatically discovering dataset-specific miscalibration patterns, AdvCali offers robust and scalable confidence estimation for GNNs in practical, high-stakes applications.

Abstract

Despite their impressive predictive performance, GNNs often exhibit poor confidence calibration, i.e., their predicted confidence scores do not accurately reflect true correctness likelihood. This issue raises concerns about their reliability in high-stakes domains such as fraud detection, and risk assessment, where well-calibrated predictions are essential for decision-making. To ensure trustworthy predictions, several GNN calibration methods are proposed. Though they can improve global calibration, our experiments reveal that they often fail to generalize across different node groups, leading to inaccurate confidence in node groups with different degree levels, classes, and local structures. In certain cases, they even degrade calibration compared to the original uncalibrated GNN. To address this challenge, we propose a novel AdvCali framework that adaptively enhances calibration across different node groups. Our method leverages adversarial training to automatically identify mis-calibrated node groups and applies a differentiable Group Expected Calibration Error (ECE) loss term to refine confidence estimation within these groups. This allows the model to dynamically adjust its calibration strategy without relying on dataset-specific prior knowledge about miscalibrated subgroups. Extensive experiments on real-world datasets demonstrate that our approach not only improves global calibration but also significantly enhances calibration within groups defined by feature similarity, topology, and connectivity, outperforming previous methods and demonstrating its effectiveness in practical scenarios.

Enhance GNNs with Reliable Confidence Estimation via Adversarial Calibration Learning

TL;DR

This work tackles the challenge of poorly calibrated GNN predictions in graph-structured data, where global calibration methods fail to ensure reliable confidence across node subgroups. It introduces AdvCali, an adversarial calibration framework that jointly learns node-wise temperature scaling and an adversarial group detector to identify miscalibrated subgroups, guided by a differentiable Group-ECE loss. The method demonstrates strong improvements in both global and subgroup calibration across eight real-world benchmarks and remains effective across different backbones, with ablations confirming the necessity of both the cross-entropy and Group-ECE components. By automatically discovering dataset-specific miscalibration patterns, AdvCali offers robust and scalable confidence estimation for GNNs in practical, high-stakes applications.

Abstract

Despite their impressive predictive performance, GNNs often exhibit poor confidence calibration, i.e., their predicted confidence scores do not accurately reflect true correctness likelihood. This issue raises concerns about their reliability in high-stakes domains such as fraud detection, and risk assessment, where well-calibrated predictions are essential for decision-making. To ensure trustworthy predictions, several GNN calibration methods are proposed. Though they can improve global calibration, our experiments reveal that they often fail to generalize across different node groups, leading to inaccurate confidence in node groups with different degree levels, classes, and local structures. In certain cases, they even degrade calibration compared to the original uncalibrated GNN. To address this challenge, we propose a novel AdvCali framework that adaptively enhances calibration across different node groups. Our method leverages adversarial training to automatically identify mis-calibrated node groups and applies a differentiable Group Expected Calibration Error (ECE) loss term to refine confidence estimation within these groups. This allows the model to dynamically adjust its calibration strategy without relying on dataset-specific prior knowledge about miscalibrated subgroups. Extensive experiments on real-world datasets demonstrate that our approach not only improves global calibration but also significantly enhances calibration within groups defined by feature similarity, topology, and connectivity, outperforming previous methods and demonstrating its effectiveness in practical scenarios.

Paper Structure

This paper contains 21 sections, 1 theorem, 13 equations, 6 figures, 4 tables, 1 algorithm.

Key Result

proposition 1

When the group weight matrix $\mathbf{G}$ has dimensions $\mathbb{R}^{N\times M}$ and takes values $\mathbf{G}_{i, j}= \mathbbm{1}(v_i \in \mathcal{B}_j)$, and the distance metric is the absolute error, i.e., $\mathrm{dist}(x_1, x_2) = |x_1 - x_2|$, the Expected Calibration Error defined in Definiti

Figures (6)

  • Figure 1: Visualization of calibration performance on the Pubmed (left) and Cora (right) datasets under different evaluation metrics. The y-axis represents the metric scores; lower scores indicate better calibration performance.
  • Figure 2: Reliability diagrams of various calibration methods on the Pubmed dataset for the whole graph (left) and the top 25% high-degree nodes (right). The y-axis denotes accuracy, and the x-axis denotes confidence. The diagonal line indicates perfect calibration, where confidence aligns exactly with accuracy. A curve above the diagonal means that accuracy exceeds confidence, indicating an under-confident model. Conversely, a curve below the diagonal implies accuracy is lower than confidence, meaning the model is overconfident.
  • Figure 3: Illustration of the model structure.
  • Figure 4: Ablation study results.
  • Figure 5: Hyperparameter analysis of the Group-ECE contribution factor $\lambda$ and the number of groups $K$.
  • ...and 1 more figures

Theorems & Definitions (2)

  • Definition 1: Expected calibration error (ECE)
  • proposition 1: Relationship between Group-ECE and ECE