Distribution Learning for Molecular Regression

Nima Shoghi; Pooya Shoghi; Anuroop Sriram; Abhishek Das

Distribution Learning for Molecular Regression

Nima Shoghi, Pooya Shoghi, Anuroop Sriram, Abhishek Das

TL;DR

This work tackles regression with soft targets by analyzing histogram-based targets and biases, then introduces Distributional Mixture of Experts (DMoE), which learns target distributions via a loss that combines histogram cross-entropy and a distance term. The approach yields consistent improvements across OC20, MD17, and QM9, across multiple GNN backbones, and provides formal gradient bounds and uncertainty metrics. Key contributions include addressing distribution quantization and distance biases, proposing non-uniform bin distributions with multi-head histograms, and offering uncertainty measures (entropy and KL divergence) with calibration techniques. While effective on ID data and several OOD scenarios, the method exhibits limited gains on certain OC20 OOD splits, highlighting domain sensitivity and the need for further OOD-specific design choices.

Abstract

Using "soft" targets to improve model performance has been shown to be effective in classification settings, but the usage of soft targets for regression is a much less studied topic in machine learning. The existing literature on the usage of soft targets for regression fails to properly assess the method's limitations, and empirical evaluation is quite limited. In this work, we assess the strengths and drawbacks of existing methods when applied to molecular property regression tasks. Our assessment outlines key biases present in existing methods and proposes methods to address them, evaluated through careful ablation studies. We leverage these insights to propose Distributional Mixture of Experts (DMoE): A model-independent, and data-independent method for regression which trains a model to predict probability distributions of its targets. Our proposed loss function combines the cross entropy between predicted and target distributions and the L1 distance between their expected values to produce a loss function that is robust to the outlined biases. We evaluate the performance of DMoE on different molecular property prediction datasets -- Open Catalyst (OC20), MD17, and QM9 -- across different backbone model architectures -- SchNet, GemNet, and Graphormer. Our results demonstrate that the proposed method is a promising alternative to classical regression for molecular property prediction tasks, showing improvements over baselines on all datasets and architectures.

Distribution Learning for Molecular Regression

TL;DR

Abstract

Paper Structure (48 sections, 4 theorems, 31 equations, 27 figures, 24 tables)

This paper contains 48 sections, 4 theorems, 31 equations, 27 figures, 24 tables.

Introduction
Histogram Regression
Distribution Quantization Error
Histogram Bin Distribution
Histogram Distance Bias
Analysis
Stable Gradients
Uncertainty Quantification
Evaluation
Open Catalyst 2020: Relaxed Energy Prediction
MD17: Molecular Dynamics
QM9: Molecular Property Prediction
Uncertainty Quantification
Ablations
Bin Distribution
...and 33 more sections

Key Result

Theorem 1

Assume that $g(x)$ is locally $l$-Lipschitz continuous w.r.t the model's parameters, $\theta$: Then, the norm of the gradient of DMoE loss w.r.t. $\theta$ is bounded by:

Figures (27)

Figure 1: Overview of histogram regression. DMoE additions are highlighted in green.
Figure 2: For a normally distributed regression target, notice how traditional uniform histograms, shown in (a), yields a much lower precision (and thus a higher error) than the histogram with normally distributed bins, shown in (b). Using \ref{['eq:quantization-error']}, we compute the quantization error of (a) to be $6.07$ and (b) to be $2.53$.
Figure 3: The model's output is represented by the green histograms, and the red and blue histograms represent two sample model predictions. The histogram loss value (HL) and the distance-based loss metric (DL) are shown in the legend box. Note that in (a), the red histogram is much closer to the ground truth than the blue histogram, but their HL values are equal. The distance-based metric fixes this. (b) demonstrates how this phenomenon is present even if we induce a Gaussian distribution.
Figure 4: ${\left\|f^2(x)\right\|}_2$ values for example Gaussian, Categorical, and Uniform distributions.
Figure 5: For each figure, the two output histograms are displayed in red and blue. The ground-truth value is displayed by the green line. (a) shows a multi-histogram scenario with uniform bin distributions where the two output histograms share the bin endpoint. (b) shows a multi-histogram scenario where the two output histograms' bin endpoint distributions are adjusted using \ref{['alg:appendix:multihist']}.
...and 22 more figures

Theorems & Definitions (7)

Theorem 1
Lemma 1
proof
Lemma 2
proof
Theorem 2
proof

Distribution Learning for Molecular Regression

TL;DR

Abstract

Distribution Learning for Molecular Regression

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (27)

Theorems & Definitions (7)