Table of Contents
Fetching ...

A Python library for efficient computation of molecular fingerprints

Michał Szafarczyk, Piotr Ludynia, Przemysław Kukla

TL;DR

The paper introduces scikit-fingerprints, a Python library designed for efficient, parallel computation of molecular fingerprints with a scikit-learn-like API, addressing the need for scalable preprocessing in chemoinformatics ML workflows. It implements multiple fingerprint families (ECFP, Atom Pair, Topological Torsion, MACCS Keys, ErG, MAP4, MHFP, E3FP) and reimplements MAP4/MHFP to reduce dependencies, enabling fast, batch processing on large datasets. Benchmarks on the HIV MoleculeNet dataset demonstrate substantial speedups via multiprocessing and competitive ML performance with simple models, highlighting that traditional fingerprints remain effective even against modern GNNs. The work also details robust DevOps (Poetry, CI/CD, MIT license) and code-quality practices to support open-source collaboration and reproducibility, with clear plans for expanding fingerprint coverage and 3D features in future work. Overall, the library enables scalable fingerprint-based ML in chemistry, offering a practical, extensible foundation for researchers and practitioners.

Abstract

Machine learning solutions are very popular in the field of chemoinformatics, where they have numerous applications, such as novel drug discovery or molecular property prediction. Molecular fingerprints are algorithms commonly used for vectorizing chemical molecules as a part of preprocessing in this kind of solution. However, despite their popularity, there are no libraries that implement them efficiently for large datasets, utilizing modern, multicore architectures. On top of that, most of them do not provide the user with an intuitive interface, or one that would be compatible with other machine learning tools. In this project, we created a Python library that computes molecular fingerprints efficiently and delivers an interface that is comprehensive and enables the user to easily incorporate the library into their existing machine learning workflow. The library enables the user to perform computation on large datasets using parallelism. Because of that, it is possible to perform such tasks as hyperparameter tuning in a reasonable time. We describe tools used in implementation of the library and asses its time performance on example benchmark datasets. Additionally, we show that using molecular fingerprints we can achieve results comparable to state-of-the-art ML solutions even with very simple models.

A Python library for efficient computation of molecular fingerprints

TL;DR

The paper introduces scikit-fingerprints, a Python library designed for efficient, parallel computation of molecular fingerprints with a scikit-learn-like API, addressing the need for scalable preprocessing in chemoinformatics ML workflows. It implements multiple fingerprint families (ECFP, Atom Pair, Topological Torsion, MACCS Keys, ErG, MAP4, MHFP, E3FP) and reimplements MAP4/MHFP to reduce dependencies, enabling fast, batch processing on large datasets. Benchmarks on the HIV MoleculeNet dataset demonstrate substantial speedups via multiprocessing and competitive ML performance with simple models, highlighting that traditional fingerprints remain effective even against modern GNNs. The work also details robust DevOps (Poetry, CI/CD, MIT license) and code-quality practices to support open-source collaboration and reproducibility, with clear plans for expanding fingerprint coverage and 3D features in future work. Overall, the library enables scalable fingerprint-based ML in chemistry, offering a practical, extensible foundation for researchers and practitioners.

Abstract

Machine learning solutions are very popular in the field of chemoinformatics, where they have numerous applications, such as novel drug discovery or molecular property prediction. Molecular fingerprints are algorithms commonly used for vectorizing chemical molecules as a part of preprocessing in this kind of solution. However, despite their popularity, there are no libraries that implement them efficiently for large datasets, utilizing modern, multicore architectures. On top of that, most of them do not provide the user with an intuitive interface, or one that would be compatible with other machine learning tools. In this project, we created a Python library that computes molecular fingerprints efficiently and delivers an interface that is comprehensive and enables the user to easily incorporate the library into their existing machine learning workflow. The library enables the user to perform computation on large datasets using parallelism. Because of that, it is possible to perform such tasks as hyperparameter tuning in a reasonable time. We describe tools used in implementation of the library and asses its time performance on example benchmark datasets. Additionally, we show that using molecular fingerprints we can achieve results comparable to state-of-the-art ML solutions even with very simple models.
Paper Structure (45 sections, 1 equation, 26 figures, 10 tables)

This paper contains 45 sections, 1 equation, 26 figures, 10 tables.

Figures (26)

  • Figure 1: Representation of a molecule in SMILES format smiles-language.
  • Figure 2: The workflow for virtual screening accelerated-design-admixtures.
  • Figure 3: The scheme of the molecular conformation and corresponding energy conformers.
  • Figure 4: Superimposable and non-superimposable molecules chirality.
  • Figure 5: The construction of Atom Pair fingerprint atom-pair.
  • ...and 21 more figures