Table of Contents
Fetching ...

Molecular Topological Profile (MOLTOP) -- Simple and Strong Baseline for Molecular Graph Classification

Jakub Adamczyk, Wojciech Czech

TL;DR

MOLTOP addresses the need for strong, easy-to-use baselines in molecular graph classification by fusing topological edge descriptors with simple atom/bond features and a hyperparameter-free Random Forest, yielding fast, robust performance. It aggregates degree-based topology and edge-level descriptors (EBC, ARI, SCAN) into histograms, with atom/bond statistics encoded as features, and uses the dataset median size to set histogram bins $n_{bins}$, enabling strong discrimination without tuning. Across MoleculeNet with scaffold splits and out-of-domain peptide data, MOLTOP often surpasses many GNNs, including those without pretraining, while remaining computationally efficient and stable; only a few pretrained models (e.g., GEM) outperform it, albeit with substantial practical downsides. The work emphasizes the ongoing value of descriptor-based baselines for fair evaluation and provides extensive analyses of feature importance, expressivity, and practical deployment, suggesting that strong, simple baselines remain essential for understanding advances in graph representation learning for chemistry.

Abstract

We revisit the effectiveness of topological descriptors for molecular graph classification and design a simple, yet strong baseline. We demonstrate that a simple approach to feature engineering - employing histogram aggregation of edge descriptors and one-hot encoding for atomic numbers and bond types - when combined with a Random Forest classifier, can establish a strong baseline for Graph Neural Networks (GNNs). The novel algorithm, Molecular Topological Profile (MOLTOP), integrates Edge Betweenness Centrality, Adjusted Rand Index and SCAN Structural Similarity score. This approach proves to be remarkably competitive when compared to modern GNNs, while also being simple, fast, low-variance and hyperparameter-free. Our approach is rigorously tested on MoleculeNet datasets using fair evaluation protocol provided by Open Graph Benchmark. We additionally show out-of-domain generation capabilities on peptide classification task from Long Range Graph Benchmark. The evaluations across eleven benchmark datasets reveal MOLTOP's strong discriminative capabilities, surpassing the $1$-WL test and even $3$-WL test for some classes of graphs. Our conclusion is that descriptor-based baselines, such as the one we propose, are still crucial for accurately assessing advancements in the GNN domain.

Molecular Topological Profile (MOLTOP) -- Simple and Strong Baseline for Molecular Graph Classification

TL;DR

MOLTOP addresses the need for strong, easy-to-use baselines in molecular graph classification by fusing topological edge descriptors with simple atom/bond features and a hyperparameter-free Random Forest, yielding fast, robust performance. It aggregates degree-based topology and edge-level descriptors (EBC, ARI, SCAN) into histograms, with atom/bond statistics encoded as features, and uses the dataset median size to set histogram bins , enabling strong discrimination without tuning. Across MoleculeNet with scaffold splits and out-of-domain peptide data, MOLTOP often surpasses many GNNs, including those without pretraining, while remaining computationally efficient and stable; only a few pretrained models (e.g., GEM) outperform it, albeit with substantial practical downsides. The work emphasizes the ongoing value of descriptor-based baselines for fair evaluation and provides extensive analyses of feature importance, expressivity, and practical deployment, suggesting that strong, simple baselines remain essential for understanding advances in graph representation learning for chemistry.

Abstract

We revisit the effectiveness of topological descriptors for molecular graph classification and design a simple, yet strong baseline. We demonstrate that a simple approach to feature engineering - employing histogram aggregation of edge descriptors and one-hot encoding for atomic numbers and bond types - when combined with a Random Forest classifier, can establish a strong baseline for Graph Neural Networks (GNNs). The novel algorithm, Molecular Topological Profile (MOLTOP), integrates Edge Betweenness Centrality, Adjusted Rand Index and SCAN Structural Similarity score. This approach proves to be remarkably competitive when compared to modern GNNs, while also being simple, fast, low-variance and hyperparameter-free. Our approach is rigorously tested on MoleculeNet datasets using fair evaluation protocol provided by Open Graph Benchmark. We additionally show out-of-domain generation capabilities on peptide classification task from Long Range Graph Benchmark. The evaluations across eleven benchmark datasets reveal MOLTOP's strong discriminative capabilities, surpassing the -WL test and even -WL test for some classes of graphs. Our conclusion is that descriptor-based baselines, such as the one we propose, are still crucial for accurately assessing advancements in the GNN domain.
Paper Structure (24 sections, 2 theorems, 5 equations, 9 figures, 10 tables)

This paper contains 24 sections, 2 theorems, 5 equations, 9 figures, 10 tables.

Key Result

Theorem 1

The computational complexity for computing Adjusted Rand Index (ARI) for all existing edges in the graph is $O(k |E|)$, with the worst case complexity $O(|V||E|)$, which occurs for full graphs.

Figures (9)

  • Figure 1: Average MOLTOP feature importances.
  • Figure 2: DPCC and Paclitaxel molecules.
  • Figure 3: Normalized histograms of edge descriptors for DPCC and Paclitaxel: (a) EBC (b) ARI (c) SCAN.
  • Figure 4: Molecule sizes distribution: (a) HIV dataset, (b) ToxCast dataset.
  • Figure 5: Feature extraction scheme in MOLTOP.
  • ...and 4 more figures

Theorems & Definitions (4)

  • Theorem 1
  • proof
  • Theorem 2
  • proof