Table of Contents
Fetching ...

Breaking the Simplification Bottleneck in Amortized Neural Symbolic Regression

Paul Saegert, Ullrich Köthe

TL;DR

This work tackles the slowdown of amortized symbolic regression caused by costly normalization and simplification steps. It introduces SimpliPy, a fast, pattern-based simplification engine, and Flash-ANSR, a Transformer-based framework for on-the-fly data generation and scalable inference, enabling large-scale training and improved recovery on the FastSRB benchmark. SimpliPy delivers up to ~100x speedups over SymPy with comparable simplification quality, unlocking training on hundreds of millions of expression-data pairs and rigorous evaluation. On realistic scientific data, Flash-ANSR achieves competitive recovery with state-of-the-art GP methods (PySR) while producing more concise expressions as the inference budget increases, and underlines the importance of rigorous decontamination and evaluation protocols. Limitations include robustness to noisy data, guiding future work to incorporate noise into training and broaden data distributions and encodings.

Abstract

Symbolic regression (SR) aims to discover interpretable analytical expressions that accurately describe observed data. Amortized SR promises to be much more efficient than the predominant genetic programming SR methods, but currently struggles to scale to realistic scientific complexity. We find that a key obstacle is the lack of a fast reduction of equivalent expressions to a concise normalized form. Amortized SR has addressed this by general-purpose Computer Algebra Systems (CAS) like SymPy, but the high computational cost severely limits training and inference speed. We propose SimpliPy, a rule-based simplification engine achieving a 100-fold speed-up over SymPy at comparable quality. This enables substantial improvements in amortized SR, including scalability to much larger training sets, more efficient use of the per-expression token budget, and systematic training set decontamination with respect to equivalent test expressions. We demonstrate these advantages in our Flash-ANSR framework, which achieves much better accuracy than amortized baselines (NeSymReS, E2E) on the FastSRB benchmark. Moreover, it performs on par with state-of-the-art direct optimization (PySR) while recovering more concise instead of more complex expressions with increasing inference budget.

Breaking the Simplification Bottleneck in Amortized Neural Symbolic Regression

TL;DR

This work tackles the slowdown of amortized symbolic regression caused by costly normalization and simplification steps. It introduces SimpliPy, a fast, pattern-based simplification engine, and Flash-ANSR, a Transformer-based framework for on-the-fly data generation and scalable inference, enabling large-scale training and improved recovery on the FastSRB benchmark. SimpliPy delivers up to ~100x speedups over SymPy with comparable simplification quality, unlocking training on hundreds of millions of expression-data pairs and rigorous evaluation. On realistic scientific data, Flash-ANSR achieves competitive recovery with state-of-the-art GP methods (PySR) while producing more concise expressions as the inference budget increases, and underlines the importance of rigorous decontamination and evaluation protocols. Limitations include robustness to noisy data, guiding future work to incorporate noise into training and broaden data distributions and encodings.

Abstract

Symbolic regression (SR) aims to discover interpretable analytical expressions that accurately describe observed data. Amortized SR promises to be much more efficient than the predominant genetic programming SR methods, but currently struggles to scale to realistic scientific complexity. We find that a key obstacle is the lack of a fast reduction of equivalent expressions to a concise normalized form. Amortized SR has addressed this by general-purpose Computer Algebra Systems (CAS) like SymPy, but the high computational cost severely limits training and inference speed. We propose SimpliPy, a rule-based simplification engine achieving a 100-fold speed-up over SymPy at comparable quality. This enables substantial improvements in amortized SR, including scalability to much larger training sets, more efficient use of the per-expression token budget, and systematic training set decontamination with respect to equivalent test expressions. We demonstrate these advantages in our Flash-ANSR framework, which achieves much better accuracy than amortized baselines (NeSymReS, E2E) on the FastSRB benchmark. Moreover, it performs on par with state-of-the-art direct optimization (PySR) while recovering more concise instead of more complex expressions with increasing inference budget.
Paper Structure (47 sections, 18 equations, 18 figures, 6 tables, 2 algorithms)

This paper contains 47 sections, 18 equations, 18 figures, 6 tables, 2 algorithms.

Figures (18)

  • Figure 1: The Flash-ANSR training pipeline. Following the established standard encoder-decoder paradigm, our framework integrates SimpliPy (top center) into the loop for synchronous simplification of on-the-fly generated training expressions.
  • Figure 2: Left: Validation Numeric Recovery Rate (vNRR) as a function of inference time (log scale). Flash-ANSR models (shades of blue) scale monotonically with compute, with the 120M model partially surpassing the PySR baseline (red). Baselines NeSymReS and E2E fail to generalize to the benchmark. Right: Expression Length Ratio $|\hat{\bm{\tau}}|/|\bm{\tau}|$ versus compute. We observe a parsimony inversion: while PySR increases complexity to minimize error over time, Flash-ANSR converges toward simpler, more canonical expressions as the sampling budget increases. Shaded regions denote 95% confidence intervals.
  • Figure 3: Flash-ANSR fits to its own and PySR's scaling curves $\log_{10}(T)$ vs vNRR from Figure \ref{['fig:small_test_time_compute_fastsrb']} (left) using v23.0-120M, $\gamma = 0.15$, 128k choices $\approx$ 10 min. Extrapolation suggests an asymptotic vNRR $\propto \log T$ scaling for Flash-ANSR, and an asymptotic upper limit for PySR around $53\%$.
  • Figure 4: Left: Empirical Cumulative Distribution Functions (ECDFs) of simplification wall-clock time. Our SimpliPy rewriting engine (shades of blue, varying $L_{\max}$) operates in the low to moderate millisecond regime, orders of magnitude faster than the SymPy baseline (orange, red). Right: ECDF of the simplification ratio $|\bm{\tau^*}| / |\bm{\tau}|$. The inset highlights the tail of the distribution. Our method with $L_{\max} \ge 5$ achieves simplification ratios comparable to the SymPy baseline while maintaining high throughput.
  • Figure 5: Left: Validation Numeric Recovery Rate (vNRR) vs. the number of support points $M$. Flash-ANSR (120M) outperforms PySR in the dense data regime ($M > 64$). Right: Expression Length Ratio $|\hat{\bm{\tau}}|/|\bm{\tau}|$. We observe a distinct "Complexity Peak" at $M \approx 8$, where the model generates expressions significantly longer than the ground truth. This peak coincides with a regime of high uncertainty (low log-probability) and excess constant usage, suggesting the model is interpolating the sparse points via complex aliasing rather than identifying the underlying law.
  • ...and 13 more figures