Table of Contents
Fetching ...

Symbolic Neural Generation with Applications to Lead Discovery in Drug Design

Ashwin Srinivasan, A Baskar, Tirtharaj Dash, Michael Bain, Sanjay Kumar Dey, Mainak Banerjee

TL;DR

The paper introduces Symbolic Neural Generators (SNGs), a neurosymbolic framework that couples symbolic learning with neural generation to produce data that satisfy verifiable, human-understandable constraints. Using a poset-based semantics built from a base of symbolic hypotheses and neural fibres, augmented by a Grothendieck construction, SNG provides a principled space to search for $(H,X,W)$ triplets where $H$ encodes symbolic rules, $X$ are neural-generated samples, and $W$ weights their joint plausibility. Implemented as GenMol, an ILP-LLM hybrid, the approach shows competitive performance on well-understood drug targets and yields novel, synthesizable leads in exploratory problems, with symbolic specifications that help filter and interpret results for chemists and biologists. The work demonstrates the practical value of integrating symbolic reasoning with neural generation in drug design and suggests broad applicability to other domains requiring constraint-governed data generation and verifiable outputs.

Abstract

We investigate a relatively underexplored class of hybrid neurosymbolic models integrating symbolic learning with neural reasoning to construct data generators meeting formal correctness criteria. In \textit{Symbolic Neural Generators} (SNGs), symbolic learners examine logical specifications of feasible data from a small set of instances -- sometimes just one. Each specification in turn constrains the conditional information supplied to a neural-based generator, which rejects any instance violating the symbolic specification. Like other neurosymbolic approaches, SNG exploits the complementary strengths of symbolic and neural methods. The outcome of an SNG is a triple $(H, X, W)$, where $H$ is a symbolic description of feasible instances constructed from data, $X$ a set of generated new instances that satisfy the description, and $W$ an associated weight. We introduce a semantics for such systems, based on the construction of appropriate \textit{base} and \textit{fibre} partially-ordered sets combined into an overall partial order, and outline a probabilistic extension relevant to practical applications. In this extension, SNGs result from searching over a weighted partial ordering. We implement an SNG combining a restricted form of Inductive Logic Programming (ILP) with a large language model (LLM) and evaluate it on early-stage drug design. Our main interest is the description and the set of potential inhibitor molecules generated by the SNG. On benchmark problems -- where drug targets are well understood -- SNG performance is statistically comparable to state-of-the-art methods. On exploratory problems with poorly understood targets, generated molecules exhibit binding affinities on par with leading clinical candidates. Experts further find the symbolic specifications useful as preliminary filters, with several generated molecules identified as viable for synthesis and wet-lab testing.

Symbolic Neural Generation with Applications to Lead Discovery in Drug Design

TL;DR

The paper introduces Symbolic Neural Generators (SNGs), a neurosymbolic framework that couples symbolic learning with neural generation to produce data that satisfy verifiable, human-understandable constraints. Using a poset-based semantics built from a base of symbolic hypotheses and neural fibres, augmented by a Grothendieck construction, SNG provides a principled space to search for triplets where encodes symbolic rules, are neural-generated samples, and weights their joint plausibility. Implemented as GenMol, an ILP-LLM hybrid, the approach shows competitive performance on well-understood drug targets and yields novel, synthesizable leads in exploratory problems, with symbolic specifications that help filter and interpret results for chemists and biologists. The work demonstrates the practical value of integrating symbolic reasoning with neural generation in drug design and suggests broad applicability to other domains requiring constraint-governed data generation and verifiable outputs.

Abstract

We investigate a relatively underexplored class of hybrid neurosymbolic models integrating symbolic learning with neural reasoning to construct data generators meeting formal correctness criteria. In \textit{Symbolic Neural Generators} (SNGs), symbolic learners examine logical specifications of feasible data from a small set of instances -- sometimes just one. Each specification in turn constrains the conditional information supplied to a neural-based generator, which rejects any instance violating the symbolic specification. Like other neurosymbolic approaches, SNG exploits the complementary strengths of symbolic and neural methods. The outcome of an SNG is a triple , where is a symbolic description of feasible instances constructed from data, a set of generated new instances that satisfy the description, and an associated weight. We introduce a semantics for such systems, based on the construction of appropriate \textit{base} and \textit{fibre} partially-ordered sets combined into an overall partial order, and outline a probabilistic extension relevant to practical applications. In this extension, SNGs result from searching over a weighted partial ordering. We implement an SNG combining a restricted form of Inductive Logic Programming (ILP) with a large language model (LLM) and evaluate it on early-stage drug design. Our main interest is the description and the set of potential inhibitor molecules generated by the SNG. On benchmark problems -- where drug targets are well understood -- SNG performance is statistically comparable to state-of-the-art methods. On exploratory problems with poorly understood targets, generated molecules exhibit binding affinities on par with leading clinical candidates. Experts further find the symbolic specifications useful as preliminary filters, with several generated molecules identified as viable for synthesis and wet-lab testing.

Paper Structure

This paper contains 24 sections, 5 theorems, 12 equations, 14 figures, 2 algorithms.

Key Result

Proposition 1

$({\cal H},\geq_{\cal H})$ is a partially ordered set.

Figures (14)

  • Figure 1: (a) Ideally, we would like to generate instances from the set of instances for which $\Phi(x)$ is true; (b) When $\Phi(\cdot)$ is not known, we approximate $\Phi(\cdot)$ by $\Sigma(\cdot)$, obtained using the hypothesis from a symbolic learner. We want to sample instances efficiently from $S$. $N$ is the set of instances obtained from a neural-based generator. For an ideal SNG, $N \subseteq S \subseteq {\cal X}$; (c) In practice, the symbolic learner may not be perfect, and the neural-generator only has an approximate model of the conditional distribution. The set $X$ is the set of instances generated that are in $S$.
  • Figure 5: (a) Posets indexed by elements of a base poset ${\cal H}$. Each element $H$ of the base poset is associated with a fibre-poset $F(H)$. (b) The Grothendieck construction combines the base poset and the fibre-posets to form a single total poset ${\cal F}$, that consists of pairs of elements. An element $(H,X) \in {\cal F}$ is such that $H \in {\cal H}$ and $X \in F(H)$.
  • Figure 6: (a) A position that is "won-for white" (WFW) with "black-to-move" (BTM). Here depth-of-win is zero, i.e., checkmate. There are 27 such positions out of a total of 28,056; (b) A symbolic hypothesis obtained using ILP (adapted from bain:gcws) (rewritten using $\Sigma$ as required). In the context of this paper ${\cal U}$ consists of 6-tuples representing positions of the 3 pieces. The description is a Prolog-like syntax: variables start with upper-case, ":-" stands for $\leftarrow$, and "not" should be read as "not provable". The "diff/3" predicate is defined in the background knowledge and encodes file or rank differences. The "ab" predicates are new relations invented by the ILP system as it attempts to find a logical description for the depth-0 data instances. For example, in the position shown in (a), the invented "ab1" predicate ensures the white rook cannot immediately be taken by the black king. $\Sigma$ should be understood as "WFW".
  • Figure 7: Instances of WFW generated on each iteration of Gen. "Without Symbolic" represents the baseline of the LLM generating instances without any symbolic theory as part of the initial context or for verification (that is, $H = \emptyset$ in Gen). "With Symbolic" provides the WFW theory in Fig.\ref{['fig:gcws']}(b). "0-shot" means no examples are provided in $E$ (and therefore are not part of the initial context), and "5-shot" means 5 WFR positions are provided in $E$.
  • Figure 8: Statistics of binding affinities (the higher the better) for molecules obtained from GenMol on benchmark datasets. The entries represent the mean values, with standard deviations shown in parentheses. We compare against recent results using LMLF++ brahmavar2024generating and prior results using a VAE-GNN model dash2021using.
  • ...and 9 more figures

Theorems & Definitions (26)

  • Definition 1: Extension of $H$
  • Definition 2: Ordering over ${\cal H}$
  • Proposition 1
  • Definition 3: Fibred Poset of a base element
  • Definition 4: Total Poset
  • Definition 5: Probabilistic Extension
  • Definition 6: Symbolic Neural Generator
  • Definition 7: Non-vacuous extension of $H$
  • Proposition 2: Correctness of Gen
  • Example 1: Factors, Experiments, Hypotheses
  • ...and 16 more