Table of Contents
Fetching ...

ZSMILES: an approach for efficient SMILES storage for random access in Virtual Screening

Gianmarco Accordi, Davide Gadioli, Giorgio Seguini, Andrea R. Beccari, Gianluca Palermo

TL;DR

ZSMILES tackles the storage challenges of extreme-scale virtual screening by introducing a fixed, shared dictionary-based SMILES compression that preserves readability and enables random access. It leverages domain knowledge through preprocessing and dictionary pre-population, and uses a Dijkstra-based shortest-path encoding over a dictionary trie, with a CUDA-accelerated implementation for speed. The method achieves competitive compression, notably down to $0.29$ in optimized configurations, and delivers substantial speedups on GPU ($7\times$ for compression and $2\times$ for decompression) while remaining memory-bound. The work demonstrates practical impact for HPC-enabled drug discovery by reducing cold-storage footprints and enabling efficient data retrieval, with broad applicability to large SMILES libraries and VR screening pipelines.

Abstract

Virtual screening is a technique used in drug discovery to select the most promising molecules to test in a lab. To perform virtual screening, we need a large set of molecules as input, and storing these molecules can become an issue. In fact, extreme-scale high-throughput virtual screening applications require a big dataset of input molecules and produce an even bigger dataset as output. These molecules' databases occupy tens of TB of storage space, and domain experts frequently sample a small portion of this data. In this context, SMILES is a popular data format for storing large sets of molecules since it requires significantly less space to represent molecules than other formats (e.g., MOL2, SDF). This paper proposes an efficient dictionary-based approach to compress SMILES-based datasets. This approach takes advantage of domain knowledge to provide a readable output with separable SMILES, enabling random access. We examine the benefits of storing these datasets using ZSMILES to reduce the cold storage footprint in HPC systems. The main contributions concern a custom dictionary-based approach and a data pre-processing step. From experimental results, we can notice how ZSMILES leverage domain knowledge to compress x1.13 more than state of the art in similar scenarios and up to $0.29$ compression ratio. We tested a CUDA version of ZSMILES targetting NVIDIA's GPUs, showing a potential speedup of 7x.

ZSMILES: an approach for efficient SMILES storage for random access in Virtual Screening

TL;DR

ZSMILES tackles the storage challenges of extreme-scale virtual screening by introducing a fixed, shared dictionary-based SMILES compression that preserves readability and enables random access. It leverages domain knowledge through preprocessing and dictionary pre-population, and uses a Dijkstra-based shortest-path encoding over a dictionary trie, with a CUDA-accelerated implementation for speed. The method achieves competitive compression, notably down to in optimized configurations, and delivers substantial speedups on GPU ( for compression and for decompression) while remaining memory-bound. The work demonstrates practical impact for HPC-enabled drug discovery by reducing cold-storage footprints and enabling efficient data retrieval, with broad applicability to large SMILES libraries and VR screening pipelines.

Abstract

Virtual screening is a technique used in drug discovery to select the most promising molecules to test in a lab. To perform virtual screening, we need a large set of molecules as input, and storing these molecules can become an issue. In fact, extreme-scale high-throughput virtual screening applications require a big dataset of input molecules and produce an even bigger dataset as output. These molecules' databases occupy tens of TB of storage space, and domain experts frequently sample a small portion of this data. In this context, SMILES is a popular data format for storing large sets of molecules since it requires significantly less space to represent molecules than other formats (e.g., MOL2, SDF). This paper proposes an efficient dictionary-based approach to compress SMILES-based datasets. This approach takes advantage of domain knowledge to provide a readable output with separable SMILES, enabling random access. We examine the benefits of storing these datasets using ZSMILES to reduce the cold storage footprint in HPC systems. The main contributions concern a custom dictionary-based approach and a data pre-processing step. From experimental results, we can notice how ZSMILES leverage domain knowledge to compress x1.13 more than state of the art in similar scenarios and up to compression ratio. We tested a CUDA version of ZSMILES targetting NVIDIA's GPUs, showing a potential speedup of 7x.
Paper Structure (16 sections, 1 equation, 5 figures, 2 tables, 1 algorithm)

This paper contains 16 sections, 1 equation, 5 figures, 2 tables, 1 algorithm.

Figures (5)

  • Figure 1: Graphical representation of Vanillin on the left, while on the right, its SMILES representation.
  • Figure 2: Graphical representation of the SMILES pre-processing step.
  • Figure 3: Graphical representation of the compression and decompression process in ZSMILES.
  • Figure 4: Compression ratios of different tools on a mixed dataset, comparing both short-string and file-based methods.
  • Figure 5: ZSMILES normalized execution times of the C++ and CUDA implementation with different $L_{max}$ values.