Table of Contents
Fetching ...

SURFACEBENCH: Can Self-Evolving LLMs Find the Equations of 3D Scientific Surfaces?

Sanchit Kabra, Shobhnik Kriplani, Parshin Shojaee, Chandan K. Reddy

TL;DR

SurfaceBench introduces the first comprehensive benchmark for symbolic surface discovery, targeting 3D surfaces described by explicit, implicit, and parametric equations across 15 scientific domains. It pairs symbolic fidelity with geometry-aware evaluation via Chamfer and Hausdorff distances to measure functional equivalence in object space rather than text similarity. The study evaluates a broad range of baselines, including LLM-guided and non-LLM symbolic regression methods, and finds limited generalization across representations and surface complexities, with recovery rates around 4–6%. By releasing the dataset and evaluation pipeline, SurfaceBench provides a principled platform to advance compositional and geometry-aware reasoning in LLM-based scientific discovery.

Abstract

Equation discovery from data is a core challenge in machine learning for science, requiring the recovery of concise symbolic expressions that govern complex physical and geometric phenomena. Recent approaches with large language models (LLMs) show promise in symbolic regression, but their success often hinges on memorized formulas or overly simplified functional forms. Existing benchmarks exacerbate this limitation: they focus on scalar functions, ignore domain grounding, and rely on brittle string-matching based metrics that fail to capture scientific equivalence. We introduce SurfaceBench, first comprehensive benchmark for symbolic surface discovery. SurfaceBench comprises 183 tasks across 15 categories of symbolic complexity, spanning explicit, implicit, and parametric equation representation forms. Each task includes ground-truth equations, variable semantics, and synthetically sampled three dimensional data. Unlike prior SR datasets, our tasks reflect surface-level structure, resist LLM memorization through novel symbolic compositions, and are grounded in scientific domains such as fluid dynamics, robotics, electromagnetics, and geometry. To evaluate equation discovery quality, we pair symbolic checks with geometry-aware metrics such as Chamfer and Hausdorff distances, capturing both algebraic fidelity and spatial reconstruction accuracy. Our experiments reveal that state-of-the-art frameworks, while occasionally successful on specific families, struggle to generalize across representation types and surface complexities. SurfaceBench thus establishes a challenging and diagnostic testbed that bridges symbolic reasoning with geometric reconstruction, enabling principled benchmarking of progress in compositional generalization, data-driven scientific induction, and geometry-aware reasoning with LLMs. We release the code here: https://github.com/Sanchit-404/surfacebench

SURFACEBENCH: Can Self-Evolving LLMs Find the Equations of 3D Scientific Surfaces?

TL;DR

SurfaceBench introduces the first comprehensive benchmark for symbolic surface discovery, targeting 3D surfaces described by explicit, implicit, and parametric equations across 15 scientific domains. It pairs symbolic fidelity with geometry-aware evaluation via Chamfer and Hausdorff distances to measure functional equivalence in object space rather than text similarity. The study evaluates a broad range of baselines, including LLM-guided and non-LLM symbolic regression methods, and finds limited generalization across representations and surface complexities, with recovery rates around 4–6%. By releasing the dataset and evaluation pipeline, SurfaceBench provides a principled platform to advance compositional and geometry-aware reasoning in LLM-based scientific discovery.

Abstract

Equation discovery from data is a core challenge in machine learning for science, requiring the recovery of concise symbolic expressions that govern complex physical and geometric phenomena. Recent approaches with large language models (LLMs) show promise in symbolic regression, but their success often hinges on memorized formulas or overly simplified functional forms. Existing benchmarks exacerbate this limitation: they focus on scalar functions, ignore domain grounding, and rely on brittle string-matching based metrics that fail to capture scientific equivalence. We introduce SurfaceBench, first comprehensive benchmark for symbolic surface discovery. SurfaceBench comprises 183 tasks across 15 categories of symbolic complexity, spanning explicit, implicit, and parametric equation representation forms. Each task includes ground-truth equations, variable semantics, and synthetically sampled three dimensional data. Unlike prior SR datasets, our tasks reflect surface-level structure, resist LLM memorization through novel symbolic compositions, and are grounded in scientific domains such as fluid dynamics, robotics, electromagnetics, and geometry. To evaluate equation discovery quality, we pair symbolic checks with geometry-aware metrics such as Chamfer and Hausdorff distances, capturing both algebraic fidelity and spatial reconstruction accuracy. Our experiments reveal that state-of-the-art frameworks, while occasionally successful on specific families, struggle to generalize across representation types and surface complexities. SurfaceBench thus establishes a challenging and diagnostic testbed that bridges symbolic reasoning with geometric reconstruction, enabling principled benchmarking of progress in compositional generalization, data-driven scientific induction, and geometry-aware reasoning with LLMs. We release the code here: https://github.com/Sanchit-404/surfacebench

Paper Structure

This paper contains 25 sections, 2 equations, 11 figures, 5 tables.

Figures (11)

  • Figure 1: SurfaceBench: A benchmark suite for symbolic regression featuring 183 surface equations spanning 15 scientific domains. The benchmark covers three canonical equation representations: explicit (red), implicit (blue), and parametric (yellow), thus illustrating diverse surface structures and symbolic challenges.
  • Figure 2: Dataset curation pipeline for SurfaceBench, ensuring a diverse set of seed equations, their transformation to discourage memorization, and rigorous validation through novelty and solvability checks.
  • Figure 3: The SurfaceBench evaluation pipeline integrates symbolic and geometric metrics to assess equation recovery quality. Given sampled 3D surface data, self-evolving LLM frameworks generate candidate symbolic expressions. These predictions are compared against the ground truth using three complementary evaluation modes: regression-style errors (NMSE), symbolic accuracy (via equivalence checks), and geometry-aware distance metrics, namely Chamfer and Hausdorff distances.
  • Figure 4: Noise sensitivity analysis across Chamfer Distance, Hausdorff Distance, and nMSE. Lower values indicate better performance.
  • Figure 6: Failure modes of two LLM based symbolic regression methods: LLM-SR and OpenEvolve. We identify two modes of errors: (i) search space errors, where the frameworks make severe errors to find the correct functional families and (ii) equation fitting errors, where frameworks are unable to optimize the equation comprising of correct functional families.
  • ...and 6 more figures