Table of Contents
Fetching ...

Cycle-Configuration: A Novel Graph-theoretic Descriptor Set for Molecular Inference

Bowen Song, Jianshen Zhu, Naveed Ahmed Azam, Kazuya Haraguchi, Liang Zhao, Tatsuya Akutsu

TL;DR

The paper presents cycle-configurations (CC), a novel graph descriptor that augments the standard two-layered (2L) mol-infer framework to differentiate ortho/meta/para patterns around cycles. CC is integrated into a 2L+CC model with a corresponding MILP formulation, enabling both improved ML predictions across 27 properties and practical inverse design of molecular graphs with up to 50 non-hydrogen atoms. Empirical results show CC descriptors yield better or comparable performance to 2L on many datasets and that the MILP can infer feasible chemical graphs within minutes. This work broadens the applicability of MILP-based molecular inference and sets the stage for extensions to polymers and multi-objective designs.

Abstract

In this paper, we propose a novel family of descriptors of chemical graphs, named cycle-configuration (CC), that can be used in the standard "two-layered (2L) model" of mol-infer, a molecular inference framework based on mixed integer linear programming (MILP) and machine learning (ML). Proposed descriptors capture the notion of ortho/meta/para patterns that appear in aromatic rings, which has been impossible in the framework so far. Computational experiments show that, when the new descriptors are supplied, we can construct prediction functions of similar or better performance for all of the 27 tested chemical properties. We also provide an MILP formulation that asks for a chemical graph with desired properties under the 2L model with CC descriptors (2L+CC model). We show that a chemical graph with up to 50 non-hydrogen vertices can be inferred in a practical time.

Cycle-Configuration: A Novel Graph-theoretic Descriptor Set for Molecular Inference

TL;DR

The paper presents cycle-configurations (CC), a novel graph descriptor that augments the standard two-layered (2L) mol-infer framework to differentiate ortho/meta/para patterns around cycles. CC is integrated into a 2L+CC model with a corresponding MILP formulation, enabling both improved ML predictions across 27 properties and practical inverse design of molecular graphs with up to 50 non-hydrogen atoms. Empirical results show CC descriptors yield better or comparable performance to 2L on many datasets and that the MILP can infer feasible chemical graphs within minutes. This work broadens the applicability of MILP-based molecular inference and sets the stage for extensions to polymers and multi-objective designs.

Abstract

In this paper, we propose a novel family of descriptors of chemical graphs, named cycle-configuration (CC), that can be used in the standard "two-layered (2L) model" of mol-infer, a molecular inference framework based on mixed integer linear programming (MILP) and machine learning (ML). Proposed descriptors capture the notion of ortho/meta/para patterns that appear in aromatic rings, which has been impossible in the framework so far. Computational experiments show that, when the new descriptors are supplied, we can construct prediction functions of similar or better performance for all of the 27 tested chemical properties. We also provide an MILP formulation that asks for a chemical graph with desired properties under the 2L model with CC descriptors (2L+CC model). We show that a chemical graph with up to 50 non-hydrogen vertices can be inferred in a practical time.
Paper Structure (27 sections, 47 equations, 4 figures, 6 tables)

This paper contains 27 sections, 47 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: (a) the chemical graph ${\mathbb C}_0$ for catechol; (b) the chemical graph ${\mathbb C}_1$ for resorcinol; (c) the chemical graph ${\mathbb C}_2$ for hydroquinone; and (d) two fringe-trees $\psi_1$ and $\psi_2$ that appearing in all of ${\mathbb C}_0$, ${\mathbb C}_1$ and ${\mathbb C}_2$. In (b), the edge-configuration of the interior-edge indicated by a dotted rectangle is $({\mathtt C}2,{\mathtt C}3,2)$. Although $f_{\textrm{2L}}({\mathbb C}_1)=f_{\textrm{2L}}({\mathbb C}_2)$, $a({\mathbb C}_1)=0\ne1=a({\mathbb C}_2)$ holds in the data set of AhR property from Tox21 collection.
  • Figure 2: Construction of a chemical graph. (a) A seed tree. Thick squares/lines indicate ring nodes/edges, while thin circles/lines indicate non-ring nodes/edges. (b) Ring nodes are expanded to chordless 6-cycles. (c) Fringe-trees are assigned to every vertex and bond-multiplicities are assigned to every edge. Fringe-trees of non-zero heights are indicated by shade. The PubChem CID of the compound is 156839899, and the molecular formula is C$_{35}$H$_{51}$N$_9$O$_8$.
  • Figure 3: Seed trees for the inference experiments: All nodes are ring nodes. A ring edge (resp., a non-ring edge) is depicted by a thick (resp., thin) line.
  • Figure 4: Inferred chemical graphs