Symbolic Neural Generation with Applications to Lead Discovery in Drug Design
Ashwin Srinivasan, A Baskar, Tirtharaj Dash, Michael Bain, Sanjay Kumar Dey, Mainak Banerjee
TL;DR
The paper introduces Symbolic Neural Generators (SNGs), a neurosymbolic framework that couples symbolic learning with neural generation to produce data that satisfy verifiable, human-understandable constraints. Using a poset-based semantics built from a base of symbolic hypotheses and neural fibres, augmented by a Grothendieck construction, SNG provides a principled space to search for $(H,X,W)$ triplets where $H$ encodes symbolic rules, $X$ are neural-generated samples, and $W$ weights their joint plausibility. Implemented as GenMol, an ILP-LLM hybrid, the approach shows competitive performance on well-understood drug targets and yields novel, synthesizable leads in exploratory problems, with symbolic specifications that help filter and interpret results for chemists and biologists. The work demonstrates the practical value of integrating symbolic reasoning with neural generation in drug design and suggests broad applicability to other domains requiring constraint-governed data generation and verifiable outputs.
Abstract
We investigate a relatively underexplored class of hybrid neurosymbolic models integrating symbolic learning with neural reasoning to construct data generators meeting formal correctness criteria. In \textit{Symbolic Neural Generators} (SNGs), symbolic learners examine logical specifications of feasible data from a small set of instances -- sometimes just one. Each specification in turn constrains the conditional information supplied to a neural-based generator, which rejects any instance violating the symbolic specification. Like other neurosymbolic approaches, SNG exploits the complementary strengths of symbolic and neural methods. The outcome of an SNG is a triple $(H, X, W)$, where $H$ is a symbolic description of feasible instances constructed from data, $X$ a set of generated new instances that satisfy the description, and $W$ an associated weight. We introduce a semantics for such systems, based on the construction of appropriate \textit{base} and \textit{fibre} partially-ordered sets combined into an overall partial order, and outline a probabilistic extension relevant to practical applications. In this extension, SNGs result from searching over a weighted partial ordering. We implement an SNG combining a restricted form of Inductive Logic Programming (ILP) with a large language model (LLM) and evaluate it on early-stage drug design. Our main interest is the description and the set of potential inhibitor molecules generated by the SNG. On benchmark problems -- where drug targets are well understood -- SNG performance is statistically comparable to state-of-the-art methods. On exploratory problems with poorly understood targets, generated molecules exhibit binding affinities on par with leading clinical candidates. Experts further find the symbolic specifications useful as preliminary filters, with several generated molecules identified as viable for synthesis and wet-lab testing.
