Table of Contents
Fetching ...

Hash Collisions in Molecular Fingerprints: Effects on Property Prediction and Bayesian Optimization

Walter Virany, Austin Tripp

TL;DR

Hash collisions in fixed-length molecular fingerprints can inflate molecular similarity when using compressed representations. The authors implement exact fingerprints and a collision-aware approach (including a Sort&Slice baseline) and evaluate them with Gaussian process surrogates using the Tanimoto kernel k_T(x,x') = (sum_i min(x_i, x'_i)) / (sum_i max(x_i, x'_i)) and covariance k(x,x') = a^2 k_T(x,x') + sigma_n^2 delta(x,x'). On five DOCKSTRING property-prediction benchmarks, exact fingerprints yielded small but consistent improvements in regression metrics (R^2 increases between 0.006 and 0.017) over compressed fingerprints, whereas Sort&Slice was competitive. However, these improvements did not translate into significant gains in Bayesian optimization performance, with similar AUC trajectories across fingerprint types. The results suggest that mitigating hash collisions modestly benefits predictive accuracy in low-data molecular property tasks, but BO performance is largely unaffected, guiding practitioners on where collision-aware fingerprints provide value.

Abstract

Molecular fingerprinting methods use hash functions to create fixed-length vector representations of molecules. However, hash collisions cause distinct substructures to be represented with the same feature, leading to overestimates in molecular similarity calculations. We investigate whether using exact fingerprints improves accuracy compared to standard compressed fingerprints in molecular property prediction and Bayesian optimization where the underlying predictive model is a Gaussian process. We find that using exact fingerprints yields a small yet consistent improvement in predictive accuracy on five molecular property prediction benchmarks from the DOCKSTRING dataset. However, these gains did not translate to significant improvements in Bayesian optimization performance.

Hash Collisions in Molecular Fingerprints: Effects on Property Prediction and Bayesian Optimization

TL;DR

Hash collisions in fixed-length molecular fingerprints can inflate molecular similarity when using compressed representations. The authors implement exact fingerprints and a collision-aware approach (including a Sort&Slice baseline) and evaluate them with Gaussian process surrogates using the Tanimoto kernel k_T(x,x') = (sum_i min(x_i, x'_i)) / (sum_i max(x_i, x'_i)) and covariance k(x,x') = a^2 k_T(x,x') + sigma_n^2 delta(x,x'). On five DOCKSTRING property-prediction benchmarks, exact fingerprints yielded small but consistent improvements in regression metrics (R^2 increases between 0.006 and 0.017) over compressed fingerprints, whereas Sort&Slice was competitive. However, these improvements did not translate into significant gains in Bayesian optimization performance, with similar AUC trajectories across fingerprint types. The results suggest that mitigating hash collisions modestly benefits predictive accuracy in low-data molecular property tasks, but BO performance is largely unaffected, guiding practitioners on where collision-aware fingerprints provide value.

Abstract

Molecular fingerprinting methods use hash functions to create fixed-length vector representations of molecules. However, hash collisions cause distinct substructures to be represented with the same feature, leading to overestimates in molecular similarity calculations. We investigate whether using exact fingerprints improves accuracy compared to standard compressed fingerprints in molecular property prediction and Bayesian optimization where the underlying predictive model is a Gaussian process. We find that using exact fingerprints yields a small yet consistent improvement in predictive accuracy on five molecular property prediction benchmarks from the DOCKSTRING dataset. However, these gains did not translate to significant improvements in Bayesian optimization performance.

Paper Structure

This paper contains 18 sections, 2 equations, 4 figures, 9 tables.

Figures (4)

  • Figure 1: Example of collisions. (Left) Two structurally different molecules (SMILES strings are CC(=O)OC1=CC=CC=C1C(=O)O and CC(C)CC1=CC=C(C=C1)C(C)C(=O)O) with highlighted circular substructures of radius 2. (Middle) Highlighted substructures and their corresponding Morgan identifiers, as well as the resulting hash values after taking modulo 32 (corresponding to the fingerprint size). (Right) Fingerprint vector where both distinct substructures map to the same element given by the hash key, demonstrating how different substructures can map to the same dimension.
  • Figure 2: $R^2$ scores for GP regression with optimized hyperparameters as a function of fingerprint dimension on ESR2 (left) and KIT (right) targets. Exact fingerprints (purple) consistently outperform compressed fingerprints (orange) and Sort&Slice method (green). Dark lines indicate mean and shaded regions indicate $\pm1$ standard deviation across 10 random train/test splits. Note that the dimension of exact fingerprints is not changing, but the performance is included as a horizontal line for reference.
  • Figure 3: BO trajectories for two targets: ESR2 (top) and PGR (bottom). The first column shows the score of the best molecule at each iteration, and the second column shows the average score of the top 10 acquired molecules at each iteration. For fixed-length fingerprint types, two fingerprint sizes are shown: 1024 (dashed line) and 2048 (solid line). Dark lines indicate the median and shaded regions indicate the 1st and 3rd quartiles over 5 random trials. The dashed horizontal lines in the left-hand figures indicate the 99.9$^{\text{th}}$ percentile and the best possible score for each target.
  • Figure 4: Tanimoto similarity in compressed vs. exact fingerprints for 10,000 pairs of molecules. Each point represents one pair, where the x-axis is the Tanimoto similarity computed between exact fingerprints, and the y-axis is the Tanimoto similarity computed between compressed fingerprints. The diagonal line corresponds to equal similarity calculations for the two fingerprints, and points above this line indicate that compression results in overestimated similarity. Shade indicates the number of hash collisions between each pair, with darker colors representing more collisions.