Prove Symbolic Regression is NP-hard by Symbol Graph
Jinglu Song, Qiang Lu, Bozhou Tian, Jingwen Zhang, Jake Luo, Zhiguang Wang
TL;DR
The paper formalizes symbolic regression (SR) as inherently NP-hard by introducing a symbol graph that represents the entire mathematical expression space $\Omega$ and establishing an equivalence between SR and the degree-constrained Steiner Arborescence problem (DCSAP). It first confirms that DCSAP is NP-hard (and its decision version NP-complete), then shows that SR-Dec reduces to DCSAP-Dec within the symbol graph, implying SR-Dec is NP-complete and SR is NP-hard. This framework accommodates rich, non-linear expressions beyond linear forms previously considered, addressing prior limitations. The work thus justifies the prevalence of approximation methods in SR and provides a rigorous complexity-theoretic foundation for the problem. The symbol-graph approach enables a unified view of SR complexity tied to a well-studied combinatorial optimization problem, with potential implications for algorithm design and theoretical analysis. $\Omega$-to-SR$\leftrightarrow$DCSAP mappings underscore the fundamental difficulty of discovering expressive symbolic forms from data.
Abstract
Symbolic regression (SR) is the task of discovering a symbolic expression that fits a given data set from the space of mathematical expressions. Despite the abundance of research surrounding the SR problem, there's a scarcity of works that confirm its NP-hard nature. Therefore, this paper introduces the concept of a symbol graph as a comprehensive representation of the entire mathematical expression space, effectively illustrating the NP-hard characteristics of the SR problem. Leveraging the symbol graph, we establish a connection between the SR problem and the task of identifying an optimally fitted degree-constrained Steiner Arborescence (DCSAP). The complexity of DCSAP, which is proven to be NP-hard, directly implies the NP-hard nature of the SR problem.
