Table of Contents
Fetching ...

FragFM: Hierarchical Framework for Efficient Molecule Generation via Fragment-Level Discrete Flow Matching

Joongwon Lee, Seonghwan Kim, Seokhyun Moon, Hyunwoo Kim, Woo Youn Kim

TL;DR

FragFM tackles the scalability bottleneck of atom-centric molecular graph generation by introducing a fragment-level discrete flow matching framework. A coarse-to-fine autoencoder preserves atom-level connectivity while operating on a fragment-level graph, and a stochastic fragment bag enables efficient exploration of a vast fragment space. The approach supports flexible conditioning via fragment bag reweighting and classifier guidance, enabling precise property-driven design, and introduces NPGen to benchmark natural product-like molecules. Empirical results show state-of-the-art or competitive performance on standard benchmarks, strong NP-focused metrics, and substantially faster sampling, underscoring FragFM's potential for large-scale, property-aware chemical space exploration.

Abstract

We introduce FragFM, a novel hierarchical framework via fragment-level discrete flow matching for efficient molecular graph generation. FragFM generates molecules at the fragment level, leveraging a coarse-to-fine autoencoder to reconstruct details at the atom level. Together with a stochastic fragment bag strategy to effectively handle an extensive fragment space, our framework enables more efficient and scalable molecular generation. We demonstrate that our fragment-based approach achieves better property control than the atom-based method and additional flexibility through conditioning the fragment bag. We also propose a Natural Product Generation benchmark (NPGen) to evaluate modern molecular graph generative models' ability to generate natural product-like molecules. Since natural products are biologically prevalidated and differ from typical drug-like molecules, our benchmark provides a more challenging yet meaningful evaluation relevant to drug discovery. We conduct a FragFM comparative study against various models on diverse molecular generation benchmarks, including NPGen, demonstrating superior performance. The results highlight the potential of fragment-based generative modeling for large-scale, property-aware molecular design, paving the way for more efficient exploration of chemical space.

FragFM: Hierarchical Framework for Efficient Molecule Generation via Fragment-Level Discrete Flow Matching

TL;DR

FragFM tackles the scalability bottleneck of atom-centric molecular graph generation by introducing a fragment-level discrete flow matching framework. A coarse-to-fine autoencoder preserves atom-level connectivity while operating on a fragment-level graph, and a stochastic fragment bag enables efficient exploration of a vast fragment space. The approach supports flexible conditioning via fragment bag reweighting and classifier guidance, enabling precise property-driven design, and introduces NPGen to benchmark natural product-like molecules. Empirical results show state-of-the-art or competitive performance on standard benchmarks, strong NP-focused metrics, and substantially faster sampling, underscoring FragFM's potential for large-scale, property-aware chemical space exploration.

Abstract

We introduce FragFM, a novel hierarchical framework via fragment-level discrete flow matching for efficient molecular graph generation. FragFM generates molecules at the fragment level, leveraging a coarse-to-fine autoencoder to reconstruct details at the atom level. Together with a stochastic fragment bag strategy to effectively handle an extensive fragment space, our framework enables more efficient and scalable molecular generation. We demonstrate that our fragment-based approach achieves better property control than the atom-based method and additional flexibility through conditioning the fragment bag. We also propose a Natural Product Generation benchmark (NPGen) to evaluate modern molecular graph generative models' ability to generate natural product-like molecules. Since natural products are biologically prevalidated and differ from typical drug-like molecules, our benchmark provides a more challenging yet meaningful evaluation relevant to drug discovery. We conduct a FragFM comparative study against various models on diverse molecular generation benchmarks, including NPGen, demonstrating superior performance. The results highlight the potential of fragment-based generative modeling for large-scale, property-aware molecular design, paving the way for more efficient exploration of chemical space.

Paper Structure

This paper contains 66 sections, 42 equations, 29 figures, 15 tables.

Figures (29)

  • Figure 1: Overview of FragFM. (a) FragFM utilizes a hierarchical framework of coarse-to-fine autoencoder (\ref{['subsec:molecular_graph_compression_by_coarse_to_fine_autoencoder']}) and fragment-level graph flow matching (\ref{['subsec: training']}). An input atom-level graph ($\mathbf{G}$) is initially decomposed via the fragmentation rule. This is then processed by a coarse-to-fine encoder, which compresses it into a joint representation $X=(\mathcal{G},z)$ comprising a fragment-level graph $\mathcal{G}$ and a latent vector $z$ designed to capture fine-grained atomistic connectivity information not explicitly present in $\mathcal{G}$. During generation (\ref{['subsec:gen_process']}), neural network $f_\theta$ selects the most probable fragment from a fragment bag $\mathcal{B}$, which is a stochastically sampled subset of the full fragment bag $\mathcal{F}$. FragFM then employs two flow-matching processes: (i) a discrete flow generates the target fragment-level graph $\mathcal{G}_1$ from an initial $\mathcal{G}_0$ (mask and uniform prior for node and edge, respectively), operating with fragments from $\mathcal{B}$; (ii) a continuous flow generates the target latent vector $z_1$ from a Gaussian prior $\mathcal{N}(0,1)$ (from an initial $z_0$). (b) Finally, given $(\mathcal{G}_1, z_1)$, the coarse-to-fine decoder reconstructs the atom-level molecular graph by first predicting the probabilities of all possible atom-to-atom edges, and then applying the Blossom algorithm to select the edge set that maximizes the likelihood of the true graph. Further details and hyperparameters are described in \ref{['appsec:parameterization_and_hyperparameters', 'fig:method_parameterization']}.
  • Figure 2: NPGen dataset overview.(a) UMAP visualization comparing MOSES, GuacaMol, and NPGen datasets. (b) Representative molecules from NPGen with annotations from NPClassifier (pathway, superclass, and class).
  • Figure 3: Randomly selected molecules from DiGress (top) and FragFM (bottom) trained on NPGen. We randomly sample a moderate-sized molecule containing $31$ to $40$ heavy atoms. Chemically implausible moieties are highlighted in red. More examples are provided in \ref{['appsubsec:visualization_and_analysis_of_generated_molecules_on_npgen']}.
  • Figure 4: Property conditioning results for QED. MAE-FCD curves under different target QED values. Each curve shows results as the classifier guidance strength is varied.
  • Figure 5: Effect of $\lambda_\mathcal{B}$ in conditioning. (left) $\lambda_\mathcal{B}\!=\!0.4$, $\lambda_X\!=\!2.0$; the DiGress guidance level is set as $2{,}000$ for comparable FCD values. (right) MAE–FCD curves on ZINC250k with JAK2 docking score conditioning at $-11.0$ kcal/mol. Red markers indicate $\lambda_{X}\!=\!0.0$, i.e., fragment-bag-only guidance. Each curve shows results as the classifier guidance strength is varied.
  • ...and 24 more figures