Table of Contents
Fetching ...

Prove Symbolic Regression is NP-hard by Symbol Graph

Jinglu Song, Qiang Lu, Bozhou Tian, Jingwen Zhang, Jake Luo, Zhiguang Wang

TL;DR

The paper formalizes symbolic regression (SR) as inherently NP-hard by introducing a symbol graph that represents the entire mathematical expression space $\Omega$ and establishing an equivalence between SR and the degree-constrained Steiner Arborescence problem (DCSAP). It first confirms that DCSAP is NP-hard (and its decision version NP-complete), then shows that SR-Dec reduces to DCSAP-Dec within the symbol graph, implying SR-Dec is NP-complete and SR is NP-hard. This framework accommodates rich, non-linear expressions beyond linear forms previously considered, addressing prior limitations. The work thus justifies the prevalence of approximation methods in SR and provides a rigorous complexity-theoretic foundation for the problem. The symbol-graph approach enables a unified view of SR complexity tied to a well-studied combinatorial optimization problem, with potential implications for algorithm design and theoretical analysis. $\Omega$-to-SR$\leftrightarrow$DCSAP mappings underscore the fundamental difficulty of discovering expressive symbolic forms from data.

Abstract

Symbolic regression (SR) is the task of discovering a symbolic expression that fits a given data set from the space of mathematical expressions. Despite the abundance of research surrounding the SR problem, there's a scarcity of works that confirm its NP-hard nature. Therefore, this paper introduces the concept of a symbol graph as a comprehensive representation of the entire mathematical expression space, effectively illustrating the NP-hard characteristics of the SR problem. Leveraging the symbol graph, we establish a connection between the SR problem and the task of identifying an optimally fitted degree-constrained Steiner Arborescence (DCSAP). The complexity of DCSAP, which is proven to be NP-hard, directly implies the NP-hard nature of the SR problem.

Prove Symbolic Regression is NP-hard by Symbol Graph

TL;DR

The paper formalizes symbolic regression (SR) as inherently NP-hard by introducing a symbol graph that represents the entire mathematical expression space and establishing an equivalence between SR and the degree-constrained Steiner Arborescence problem (DCSAP). It first confirms that DCSAP is NP-hard (and its decision version NP-complete), then shows that SR-Dec reduces to DCSAP-Dec within the symbol graph, implying SR-Dec is NP-complete and SR is NP-hard. This framework accommodates rich, non-linear expressions beyond linear forms previously considered, addressing prior limitations. The work thus justifies the prevalence of approximation methods in SR and provides a rigorous complexity-theoretic foundation for the problem. The symbol-graph approach enables a unified view of SR complexity tied to a well-studied combinatorial optimization problem, with potential implications for algorithm design and theoretical analysis. -to-SRDCSAP mappings underscore the fundamental difficulty of discovering expressive symbolic forms from data.

Abstract

Symbolic regression (SR) is the task of discovering a symbolic expression that fits a given data set from the space of mathematical expressions. Despite the abundance of research surrounding the SR problem, there's a scarcity of works that confirm its NP-hard nature. Therefore, this paper introduces the concept of a symbol graph as a comprehensive representation of the entire mathematical expression space, effectively illustrating the NP-hard characteristics of the SR problem. Leveraging the symbol graph, we establish a connection between the SR problem and the task of identifying an optimally fitted degree-constrained Steiner Arborescence (DCSAP). The complexity of DCSAP, which is proven to be NP-hard, directly implies the NP-hard nature of the SR problem.
Paper Structure (7 sections, 3 theorems, 3 equations, 4 figures)

This paper contains 7 sections, 3 theorems, 3 equations, 4 figures.

Key Result

Lemma 1

The DCSAP problem is NP-hard.

Figures (4)

  • Figure 1: Symbol Graph.
  • Figure 2: An example of the DCSTP and DCSAP. (a) shows an undirected graph $G=(V,E,w)$ with the terminals $S$={$V_a,V_c,V_e,V_f,V_g$}. The DCSTP is connected in black lines with a weight of 13. (b) shows a directed graph with the same terminals and the root vertex $r=V_a$.
  • Figure 3: The examples of computing weights. ($a$) shows weights on edge $E_c$ and $E_x$; ($b$) and ($c$) show two substeps of calculating weights on edge $E_{op}$.
  • Figure 4: An example symbol graph G for the SR problem. The tree connected with orange lines is the DCSAP in G.

Theorems & Definitions (7)

  • Lemma 1
  • Proof 1
  • Definition 1: DCSAP-Dec
  • Lemma 2
  • Proof 2
  • Theorem 1
  • Proof 3