Hash Collisions in Molecular Fingerprints: Effects on Property Prediction and Bayesian Optimization
Walter Virany, Austin Tripp
TL;DR
Hash collisions in fixed-length molecular fingerprints can inflate molecular similarity when using compressed representations. The authors implement exact fingerprints and a collision-aware approach (including a Sort&Slice baseline) and evaluate them with Gaussian process surrogates using the Tanimoto kernel k_T(x,x') = (sum_i min(x_i, x'_i)) / (sum_i max(x_i, x'_i)) and covariance k(x,x') = a^2 k_T(x,x') + sigma_n^2 delta(x,x'). On five DOCKSTRING property-prediction benchmarks, exact fingerprints yielded small but consistent improvements in regression metrics (R^2 increases between 0.006 and 0.017) over compressed fingerprints, whereas Sort&Slice was competitive. However, these improvements did not translate into significant gains in Bayesian optimization performance, with similar AUC trajectories across fingerprint types. The results suggest that mitigating hash collisions modestly benefits predictive accuracy in low-data molecular property tasks, but BO performance is largely unaffected, guiding practitioners on where collision-aware fingerprints provide value.
Abstract
Molecular fingerprinting methods use hash functions to create fixed-length vector representations of molecules. However, hash collisions cause distinct substructures to be represented with the same feature, leading to overestimates in molecular similarity calculations. We investigate whether using exact fingerprints improves accuracy compared to standard compressed fingerprints in molecular property prediction and Bayesian optimization where the underlying predictive model is a Gaussian process. We find that using exact fingerprints yields a small yet consistent improvement in predictive accuracy on five molecular property prediction benchmarks from the DOCKSTRING dataset. However, these gains did not translate to significant improvements in Bayesian optimization performance.
