Table of Contents
Fetching ...

On Finding All Connected Maximum-Sized Common Subgraphs in Multiple Labeled Graphs

Johannes B. S. Petersen, Akbar Davoodi, Thomas Gärtner, Marc Hellmuth, Daniel Merkle

TL;DR

This work tackles the problem of finding all maximum-vertex and maximum-edge common subgraphs across multiple labeled graphs, including connected variants, which is particularly relevant for bioinformatics and cheminformatics. It introduces an exact framework built on modular product graphs and a Bron-Kerbosch-based enumeration to list all maximal cliques corresponding to MVCS/MECS, augmented with pruning and a graph-ordering heuristic derived from graph-kernel and minmax similarities. The method extends to vertex- and edge-labeled graphs and to type-A connected cliques, with careful handling of Δ-Y ambiguities in line graphs via an inverse mapping; it also provides pruning strategies and parallelizable components to improve practicality. Empirical evaluation on large molecular datasets demonstrates scalability and speedups from the proposed ordering and pruning techniques, and an open-source implementation is made available for reproducibility and broader adoption.

Abstract

We present an exact algorithm for computing all common subgraphs with the maximum number of vertices across multiple graphs. Our approach is further extended to handle the connected Maximum Common Subgraph (MCS), identifying the largest common subgraph in terms of either vertices or edges across multiple graphs, where edges or vertices may additionally be labeled to account for possible atom types or bond types, a classical labeling used in molecular graphs. Our approach leverages modular product graphs and a modified Bron-Kerbosch algorithm to enumerate maximal cliques, ensuring all intermediate solutions are retained. A pruning heuristic efficiently reduces the modular product size, improving computational feasibility. Additionally, we introduce a graph ordering strategy based on graph-kernel similarity measures to optimize the search process. Our method is particularly relevant for bioinformatics and cheminformatics, where identifying conserved structural motifs in molecular graphs is crucial. Empirical results on molecular datasets demonstrate that our approach is scalable and fast.

On Finding All Connected Maximum-Sized Common Subgraphs in Multiple Labeled Graphs

TL;DR

This work tackles the problem of finding all maximum-vertex and maximum-edge common subgraphs across multiple labeled graphs, including connected variants, which is particularly relevant for bioinformatics and cheminformatics. It introduces an exact framework built on modular product graphs and a Bron-Kerbosch-based enumeration to list all maximal cliques corresponding to MVCS/MECS, augmented with pruning and a graph-ordering heuristic derived from graph-kernel and minmax similarities. The method extends to vertex- and edge-labeled graphs and to type-A connected cliques, with careful handling of Δ-Y ambiguities in line graphs via an inverse mapping; it also provides pruning strategies and parallelizable components to improve practicality. Empirical evaluation on large molecular datasets demonstrates scalability and speedups from the proposed ordering and pruning techniques, and an open-source implementation is made available for reproducibility and broader adoption.

Abstract

We present an exact algorithm for computing all common subgraphs with the maximum number of vertices across multiple graphs. Our approach is further extended to handle the connected Maximum Common Subgraph (MCS), identifying the largest common subgraph in terms of either vertices or edges across multiple graphs, where edges or vertices may additionally be labeled to account for possible atom types or bond types, a classical labeling used in molecular graphs. Our approach leverages modular product graphs and a modified Bron-Kerbosch algorithm to enumerate maximal cliques, ensuring all intermediate solutions are retained. A pruning heuristic efficiently reduces the modular product size, improving computational feasibility. Additionally, we introduce a graph ordering strategy based on graph-kernel similarity measures to optimize the search process. Our method is particularly relevant for bioinformatics and cheminformatics, where identifying conserved structural motifs in molecular graphs is crucial. Empirical results on molecular datasets demonstrate that our approach is scalable and fast.

Paper Structure

This paper contains 13 sections, 5 theorems, 4 figures.

Key Result

lemma thmcounterlemma

Let $G_1,\dots, G_n$ be graphs. If $H$ is a maximal common induced subgraph of $G_1,\dots,G_n$, then there exists a maximal clique $K$ in $\star_{i=1}^n G_i$ of size $|V(K)| = |V(H)|$ and for which the projection $p_i$ onto the $i$-th factor satisfies $p_i(K) = H_i\simeq H$.

Figures (4)

  • Figure 1: Plots showing the relationship between the four normalized similarity measures VH, WL, NSDP or minmax (x-axis) and the values $y_\ell$ on the y-axis that reflect average number of type-A connected common subgraphs of two graphs, see text for further details.
  • Figure 2: Box-plot showing the effect of different techniques applied to the input graphs (instances of 5 molecular graphs with 35 atoms) on the runtime for finding MECS. In particular, we applied the greedy-ordering based on the four similarity measure VH, WL, NSPD and minmax together with the computation of maximal type-A connected cliques with and without the removal of type-0 edges (see Section \ref{['sec:methods']}). The term $Z\in \{\text{VH, WL, NSPD, minmax}\}$ on the $x$-axis refers to the application of measure $Z$ without this refinement step, while $Z^R$ means that the removal of certain type-0 edges has been applied. The dotted lines represent the mean values across all instances. Without greedy-ordering, the runtime exceeded in many cases one hour, if it finished at all.
  • Figure 3: An example instance with five different molecular graphs from the ChEMBL22 database with 35 non-hydrogen vertices each. There are exactly two MECS with 6 edges each. A single occurrence of the two MECS within each graph is highlighted in red and green, respectively. Note that more occurrences are possible.
  • Figure 4: By way of example, The product contains the clique $K = \{(1,5),(2,3), (4,4),(5,1),(3,2)\}\subseteq V(G \star H)$. This clique $K$ is inclusion-maximal. To see this, observe that any additional vertex $(x,y)$ we may add to $K$ to obtain a larger clique must satisfy $x,y\notin \{1,2,3,4,5\}$ as neither $G$ nor $H$ contains edges $(1,1), \dots, (5,5)$. Hence, only $x,y\in \{6,7\}$ is possible. Since $5$ is adjacent to $7$ and $3$ is adjacent to $6$ but $1$ is not adjacent to $7$ and $6$ in $G$ and $H$, it follows that the subgraph induced by $V(K)\cup \{(x,y)\}$ with $x,y\in \{6,7\}$ cannot form a clique in $G \star H$. Since $K$ corresponds to the subgraph induced by $\{1,2,3,4,5\}$ and since $G\simeq H$, it follows that $K$ does not correspond to a maximal common subgraph of $G$ and $H$.

Theorems & Definitions (8)

  • lemma thmcounterlemma
  • proposition thmcounterproposition
  • lemma thmcounterlemma
  • proof
  • lemma thmcounterlemma
  • proof
  • proposition thmcounterproposition
  • proof