Table of Contents
Fetching ...

Learning to generate feasible graphs using graph grammars

Stefan Mautner, Rolf Backofen, Fabrizio Costa

TL;DR

The paper tackles the challenge of generating large graphs with long-range dependencies while respecting domain-specific feasibility constraints. It introduces a graph-grammar–based generative framework that combines a domain-dependent coarsening procedure with constraint integration, enabling short-cuts for long-range interactions and viability guarantees within a Metropolis-Hastings sampling scheme. The approach is demonstrated on RNA secondary structures and small-molecule drugs, achieving 100% viable graphs and competitive MOSES-based quality metrics, while outperforming several neural graph generators in maintaining structural feasibility for large graphs. The results suggest a domain-agnostic yet scalable method for modeling complex dependencies in structured graphs, with future work aimed at learning coarsening strategies from data and reducing computational costs.

Abstract

Generative methods for graphs need to be sufficiently flexible to model complex dependencies between sets of nodes. At the same time, the generated graphs need to satisfy domain-dependent feasibility conditions, that is, they should not violate certain constraints that would make their interpretation impossible within the given application domain (e.g. a molecular graph where an atom has a very large number of chemical bounds). Crucially, constraints can involve not only local but also long-range dependencies: for example, the maximal length of a cycle can be bounded. Currently, a large class of generative approaches for graphs, such as methods based on artificial neural networks, is based on message passing schemes. These approaches suffer from information 'dilution' issues that severely limit the maximal range of the dependencies that can be modeled. To address this problem, we propose a generative approach based on the notion of graph grammars. The key novel idea is to introduce a domain-dependent coarsening procedure to provide short-cuts for long-range dependencies. We show the effectiveness of our proposal in two domains: 1) small drugs and 2) RNA secondary structures. In the first case, we compare the quality of the generated molecular graphs via the Molecular Sets (MOSES) benchmark suite, which evaluates the distance between generated and real molecules, their lipophilicity, synthesizability, and drug-likeness. In the second case, we show that the approach can generate very large graphs (with hundreds of nodes) that are accepted as valid examples for a desired RNA family by the "Infernal" covariance model, a state-of-the-art RNA classifier. Our implementation is available on github: github.com/fabriziocosta/GraphLearn

Learning to generate feasible graphs using graph grammars

TL;DR

The paper tackles the challenge of generating large graphs with long-range dependencies while respecting domain-specific feasibility constraints. It introduces a graph-grammar–based generative framework that combines a domain-dependent coarsening procedure with constraint integration, enabling short-cuts for long-range interactions and viability guarantees within a Metropolis-Hastings sampling scheme. The approach is demonstrated on RNA secondary structures and small-molecule drugs, achieving 100% viable graphs and competitive MOSES-based quality metrics, while outperforming several neural graph generators in maintaining structural feasibility for large graphs. The results suggest a domain-agnostic yet scalable method for modeling complex dependencies in structured graphs, with future work aimed at learning coarsening strategies from data and reducing computational costs.

Abstract

Generative methods for graphs need to be sufficiently flexible to model complex dependencies between sets of nodes. At the same time, the generated graphs need to satisfy domain-dependent feasibility conditions, that is, they should not violate certain constraints that would make their interpretation impossible within the given application domain (e.g. a molecular graph where an atom has a very large number of chemical bounds). Crucially, constraints can involve not only local but also long-range dependencies: for example, the maximal length of a cycle can be bounded. Currently, a large class of generative approaches for graphs, such as methods based on artificial neural networks, is based on message passing schemes. These approaches suffer from information 'dilution' issues that severely limit the maximal range of the dependencies that can be modeled. To address this problem, we propose a generative approach based on the notion of graph grammars. The key novel idea is to introduce a domain-dependent coarsening procedure to provide short-cuts for long-range dependencies. We show the effectiveness of our proposal in two domains: 1) small drugs and 2) RNA secondary structures. In the first case, we compare the quality of the generated molecular graphs via the Molecular Sets (MOSES) benchmark suite, which evaluates the distance between generated and real molecules, their lipophilicity, synthesizability, and drug-likeness. In the second case, we show that the approach can generate very large graphs (with hundreds of nodes) that are accepted as valid examples for a desired RNA family by the "Infernal" covariance model, a state-of-the-art RNA classifier. Our implementation is available on github: github.com/fabriziocosta/GraphLearn
Paper Structure (7 sections, 3 equations, 6 figures, 1 table)

This paper contains 7 sections, 3 equations, 6 figures, 1 table.

Figures (6)

  • Figure 1: A depiction of core and interface subgraphs. Dark shading indicates the core subgraph, while light shading represents the interface. The diagram on the left shows a molecular graph with its core substitution, while the middle and right diagrams depict RNA encodings at different resolutions (nucleotide-level and structural-element-level, respectively). At nucleotide-level nodes are labeled with nucleotides codes ACGU, while at structural level nodes are labeled as H)airpin, M)ultiloop, S)tem, D)angling end.
  • Figure 2: A molecule and its coarsened version $R=0$ (dark green), $B=1$ (left, lt. green), $T=1$ (right, lt. green). Local constraints are satisfied by the original graph on the left while longer ranging constraints (specifically the fact that the carbon atoms are not only present but form a cycle) is modeled in the coarsened version (right). The coarsening method labeled contracted cycles with the hash of the associated subgraph.
  • Figure 3: Examples of the rf01725 family of RNA, exhibiting a large variety of structural configurations.
  • Figure 4: Evaluation of generated RNAs. The Infernal bit score is the expert-models classification score. The horizontal line is the threshold for biological significance associated with this RNA family.
  • Figure 5: Evaluation of generated RNAs. Here we compare the generated sequences to the training material. The generated graphs should differ from the training set, yet induce the same probability density.
  • ...and 1 more figures