Table of Contents
Fetching ...

Instruction Set and Language for Symbolic Regression

Ezequiel Lopez-Rubio, Mario Pascual-Gonzalez

Abstract

A fundamental but largely unaddressed obstacle in Symbolic regression (SR) is structural redundancy: every expression DAG with admits many distinct node-numbering schemes that all encode the same expression, each occupying a separate point in the search space and consuming fitness evaluations without adding diversity. We present IsalSR (Instruction Set and Language for Symbolic Regression), a representation framework that encodes expression DAGs as strings over a compact two-tier alphabet and computes a pruned canonical string -- a complete labeled-DAG isomorphism invariant -- that collapses all the equivalent representations into a single canonical form.

Instruction Set and Language for Symbolic Regression

Abstract

A fundamental but largely unaddressed obstacle in Symbolic regression (SR) is structural redundancy: every expression DAG with admits many distinct node-numbering schemes that all encode the same expression, each occupying a separate point in the search space and consuming fitness evaluations without adding diversity. We present IsalSR (Instruction Set and Language for Symbolic Regression), a representation framework that encodes expression DAGs as strings over a compact two-tier alphabet and computes a pruned canonical string -- a complete labeled-DAG isomorphism invariant -- that collapses all the equivalent representations into a single canonical form.
Paper Structure (53 sections, 11 equations, 7 figures, 11 tables)

This paper contains 53 sections, 11 equations, 7 figures, 11 tables.

Figures (7)

  • Figure 1: String-to-DAG execution for the canonical string VcVspv*pv+PpcnnC. The DAG grows incrementally as each instruction inserts labeled nodes (V/v) or directed edges (C/c), with primary ($\pi$) and secondary ($\sigma$) pointers navigating the circular doubly linked list.
  • Figure 2: DAG-to-String encoding of $\sin(x_0) \cdot x_1 + \cos(x_0)$ via greedy traversal from $x_0$. Ghost nodes (dashed) indicate parts not yet encoded; the emitted token sequence converges to the canonical string as all nodes and edges are visited.
  • Figure 3: Round-trip property for the Nguyen-1 benchmark ($x_0^3 + x_0^2 + x_0$). Each row applies a different DAG-to-String (D2S) algorithm, followed by String-to-DAG (S2D) reconstruction. Columns 0, 6: original and reconstructed DAGs (isomorphic, $\cong$). Columns 1--2: D2S progressively encodes the DAG; ghost (dashed) nodes and edges indicate parts not yet encoded, while the instruction string builds up left-to-right. Column 3: the complete instruction string ($w^*$ for canonical, $w$ for greedy). Columns 4--5: S2D rebuilds the DAG from the string. The canonical and pruned algorithms produce identical strings ($|w^*|{=}19$, $>$99.97% agreement), while the greedy algorithm yields a longer encoding ($|w|{=}23$, $+$21%), yet all three round-trip to isomorphic DAGs.
  • Figure 4: Property P3 (canonical invariance and idempotence) for $\sin(x_0) + \cos(x_0)$. The original DAG $D$ is canonicalized to produce $w^{**} = \texttt{VcVspv+Ppc}$, then decoded via S2D to reconstruct $D"$, which is structurally isomorphic to $D$ (invariance: $D \cong D"$). Applying canonicalization a second time yields $w^{**'} = w^{**}$ (idempotence), confirming that the canonical string is a fixed point of the compose-and-recanonicalize map. Colour-coded token blocks show the instruction string at each transformation step.
  • Figure 5: Shortest Levenshtein path between $\cos(x_0) + x_0$ (V+VcPnc) and $\cos(x_0) + 1$ (VcVkpv+Ppc) in the canonical string space ($d_{\mathrm{Lev}} = 6$). Each step applies one character-level edit (substitution or insertion) to the current string. All intermediate strings produce valid expression DAGs, progressing through simplified $\cos(x_0)$ forms---including a $\cos(\cos(x_0))$ detour at Step 4---before reaching the target. Below each DAG: the corresponding IsalSR instruction string (colour-coded by token type) and the mathematical expression.
  • ...and 2 more figures

Theorems & Definitions (11)

  • Definition 2.1: Labeled DAG
  • Definition 2.2: IsalSR Instruction Set $\Sigma_{\mathrm{SR}}$
  • Definition 2.3: S2D Execution
  • Definition 2.4: Spiral Displacement Set
  • Definition 2.5: Valid String Set
  • Definition 2.6: Canonical String
  • Definition 2.7: 6-Component Structural Tuple
  • Definition 2.8: Pruned Canonical String
  • Definition 2.9: Labeled-DAG Isomorphism
  • Conjecture 2.10: Round-Trip Fidelity
  • ...and 1 more