Table of Contents
Fetching ...

A Metric for the Balance of Information in Graph Learning

Alex O. Davies, Nirav S. Ajmeri, Telmo de Menezes e Silva Filho

TL;DR

This paper addresses the problem of determining whether graph learning on molecules primarily uses structural information or features. It introduces Noise-Noise Ratio Difference (NNRD), a metric computed by applying independent noise to structure and features and measuring the resulting degradation in performance, summarized as a single score with $NNRD = log((1/|T|) sum_t h_X(t)/h_E(t))$. The authors validate NNRD on Open Graph Benchmark molecular datasets using a 3-layer GIN and noise across ten levels, showing that NNRD aligns with intuitive information balance and can reveal biases that simple performance aggregates miss. They discuss limitations such as model-dependence and outlier datasets, and suggest reporting NNRD for fixed models to guide dataset design and learning strategy. Overall, NNRD provides an interpretable, domain-agnostic tool for quantifying the balance of information sources in graph learning.

Abstract

Graph learning on molecules makes use of information from both the molecular structure and the features attached to that structure. Much work has been conducted on biasing either towards structure or features, with the aim that bias bolsters performance. Identifying which information source a dataset favours, and therefore how to approach learning that dataset, is an open issue. Here we propose Noise-Noise Ratio Difference (NNRD), a quantitative metric for whether there is more useful information in structure or features. By employing iterative noising on features and structure independently, leaving the other intact, NNRD measures the degradation of information in each. We employ NNRD over a range of molecular tasks, and show that it corresponds well to a loss of information, with intuitive results that are more expressive than simple performance aggregates. Our future work will focus on expanding data domains, tasks and types, as well as refining our choice of baseline model.

A Metric for the Balance of Information in Graph Learning

TL;DR

This paper addresses the problem of determining whether graph learning on molecules primarily uses structural information or features. It introduces Noise-Noise Ratio Difference (NNRD), a metric computed by applying independent noise to structure and features and measuring the resulting degradation in performance, summarized as a single score with . The authors validate NNRD on Open Graph Benchmark molecular datasets using a 3-layer GIN and noise across ten levels, showing that NNRD aligns with intuitive information balance and can reveal biases that simple performance aggregates miss. They discuss limitations such as model-dependence and outlier datasets, and suggest reporting NNRD for fixed models to guide dataset design and learning strategy. Overall, NNRD provides an interpretable, domain-agnostic tool for quantifying the balance of information sources in graph learning.

Abstract

Graph learning on molecules makes use of information from both the molecular structure and the features attached to that structure. Much work has been conducted on biasing either towards structure or features, with the aim that bias bolsters performance. Identifying which information source a dataset favours, and therefore how to approach learning that dataset, is an open issue. Here we propose Noise-Noise Ratio Difference (NNRD), a quantitative metric for whether there is more useful information in structure or features. By employing iterative noising on features and structure independently, leaving the other intact, NNRD measures the degradation of information in each. We employ NNRD over a range of molecular tasks, and show that it corresponds well to a loss of information, with intuitive results that are more expressive than simple performance aggregates. Our future work will focus on expanding data domains, tasks and types, as well as refining our choice of baseline model.

Paper Structure

This paper contains 7 sections, 1 equation, 2 figures, 1 table.

Figures (2)

  • Figure 1: A molecule undergoing structure noise through edge removal and addition. Each noise step is applied on the original molecule, meaning that these examples are not sequential.
  • Figure 2: Performance variation for supervised training of our GIN models on each molecular regression benchmark dataset with increasing noise on structure and features. All datasets except LIPO and ESOL are classification, and we report ROC-AUC.