Table of Contents
Fetching ...

ViSymRe: Vision Multimodal Symbolic Regression

Da Li, Junping Yin, Jin Xu, Xinxin Li, Juan Zhang

TL;DR

ViSymRe presents a Transformer-based Vision Symbolic Regression framework that uses Multi-View Random Slicing to visualize high-dimensional equations in 2D and fuse them with dataset information through a dual-vision pipeline. A Visual Decoder enables dataset-only inference by predicting discrete visual features via a codebook, while a Biased Cross-Attention module suppresses noise from the virtual vision during fusion. The approach is optimized with a multi-objective loss and a syntax-constrained autoregressive decoder to improve interpretability and robustness, achieving strong performance on low-dimensional and complex SR benchmarks with efficient inference. AMS and robust ablations support good generalization to varying scales and noise conditions, suggesting ViSymRe's potential for rapid scientific discovery. Overall, ViSymRe demonstrates competitive SR accuracy and considerably lower complexity in practical, low-complexity scenarios while providing a scalable path toward dataset-only deployment in multimodal SR settings.

Abstract

Extracting interpretable equations from observational datasets to describe complex natural phenomena is one of the core goals of artificial intelligence. This field is known as symbolic regression (SR). In recent years, Transformer-based paradigms have become a new trend in SR, addressing the well-known problem of inefficient search. However, the modal heterogeneity between datasets and equations often hinders the convergence and generalization of these models. In this paper, we propose ViSymRe, a Vision Symbolic Regression framework, to explore the positive role of visual modality in enhancing the performance of Transformer-based SR paradigms. To overcome the challenge where the visual SR model is untrainable in high-dimensional scenarios, we present Multi-View Random Slicing (MVRS). By projecting multivariate equations into 2-D space using random affine transformations, MVRS avoids common defects in high-dimensional visualization, such as variable degradation, non-linear interaction missing, and exponentially increasing sampling complexity, enabling ViSymRe to be trained with low computational costs. To support dataset-only deployment of ViSymRe, we design a dual-vision pipeline architecture based on generative techniques, which reconstructs visual features directly from the datasets via an auxiliary Visual Decoder and automatically suppresses the attention weights of reconstruction noise through a proposed Biased Cross-Attention feature fusion module, ensuring that subsequent processes are not affected by noisy modalities. Ablation studies demonstrate the positive contribution of visual modality to improving model convergence level and enhancing various SR metrics. Furthermore, evaluation results on mainstream benchmarks indicate that ViSymRe achieves competitive performance compared to baselines, particularly in low-complexity and rapid-inference scenarios.

ViSymRe: Vision Multimodal Symbolic Regression

TL;DR

ViSymRe presents a Transformer-based Vision Symbolic Regression framework that uses Multi-View Random Slicing to visualize high-dimensional equations in 2D and fuse them with dataset information through a dual-vision pipeline. A Visual Decoder enables dataset-only inference by predicting discrete visual features via a codebook, while a Biased Cross-Attention module suppresses noise from the virtual vision during fusion. The approach is optimized with a multi-objective loss and a syntax-constrained autoregressive decoder to improve interpretability and robustness, achieving strong performance on low-dimensional and complex SR benchmarks with efficient inference. AMS and robust ablations support good generalization to varying scales and noise conditions, suggesting ViSymRe's potential for rapid scientific discovery. Overall, ViSymRe demonstrates competitive SR accuracy and considerably lower complexity in practical, low-complexity scenarios while providing a scalable path toward dataset-only deployment in multimodal SR settings.

Abstract

Extracting interpretable equations from observational datasets to describe complex natural phenomena is one of the core goals of artificial intelligence. This field is known as symbolic regression (SR). In recent years, Transformer-based paradigms have become a new trend in SR, addressing the well-known problem of inefficient search. However, the modal heterogeneity between datasets and equations often hinders the convergence and generalization of these models. In this paper, we propose ViSymRe, a Vision Symbolic Regression framework, to explore the positive role of visual modality in enhancing the performance of Transformer-based SR paradigms. To overcome the challenge where the visual SR model is untrainable in high-dimensional scenarios, we present Multi-View Random Slicing (MVRS). By projecting multivariate equations into 2-D space using random affine transformations, MVRS avoids common defects in high-dimensional visualization, such as variable degradation, non-linear interaction missing, and exponentially increasing sampling complexity, enabling ViSymRe to be trained with low computational costs. To support dataset-only deployment of ViSymRe, we design a dual-vision pipeline architecture based on generative techniques, which reconstructs visual features directly from the datasets via an auxiliary Visual Decoder and automatically suppresses the attention weights of reconstruction noise through a proposed Biased Cross-Attention feature fusion module, ensuring that subsequent processes are not affected by noisy modalities. Ablation studies demonstrate the positive contribution of visual modality to improving model convergence level and enhancing various SR metrics. Furthermore, evaluation results on mainstream benchmarks indicate that ViSymRe achieves competitive performance compared to baselines, particularly in low-complexity and rapid-inference scenarios.

Paper Structure

This paper contains 36 sections, 2 theorems, 31 equations, 17 figures, 9 tables, 3 algorithms.

Key Result

Theorem 1

(Non-degeneracy of affine transformation). $\forall i \in \{1, \dots, d\}$, the projection $x_i\circ \Psi$ constitutes a non-trivial function $\psi_i(s, t)$ almost surely.

Figures (17)

  • Figure 1: Overview of the ViSymRe framework. During data preprocessing, the inputs, consisting of MVRS slices and randomly sampled datasets, are generated. The Dataset Encoder and Visual Encoder extract visual features ($F_v$) and dataset features ($F_s$) from these inputs, respectively. $F_v$ are then quantized into discrete representations (denoted as $\widetilde{F}_v$) in a Codebook. $\widetilde{F}_v$ and $F_s$ are fused through a standard Cross-Attention module, which is decoded into the target equation skeleton. To address the issue of unavailable MVRS slices during inference, a Visual Decoder is integrated into ViSymRe to learn to predict visual features (called virtual vision and denoted as $\hat{F}_v$) conditioned on dataset embeddings, thereby establishing an independent virtual visual pipeline that shares the same decoder with the real visual pipeline, but integrates a Biased Cross-Attention feature fusion module to suppress the attention weights of predicted noise.
  • Figure 2: Visualization results comparison of four visualization methods for the interaction term $x_1(x_2 - x_3)$. The proposed method (a) preserves the saddle shape via random isotropic slicing. Marginal projection (b) suffers from feature collapse, degenerating the non-linear interaction into a misleading linear response. Traditional methods (c and d): PCA exhibits information loss due to projection overlap, while t-SNE fails to reconstruct the manifold continuity, resulting in topological tearing even with dense sampling. Notably, MVRS requires only $N \cdot d$ points, while PCA and t-SNE require $N^d$ points. The computational burden is a key reason why traditional methods cannot support large-scale pre-training.
  • Figure 3: Overview of the Biased Cross-Attention module.$Norm$ denotes $L_2$ regularization. Only the top regions of the Bias and Attention matrices are shown for better visibility. Observed that the majority of the regions exhibit low attention scores. This phenomenon is intuitive, as it implies that each point attends to only a single or a limited number of visual features.
  • Figure 4: The Symbolic Solution Rate results on low-dimensional benchmarks. Missing results denote a Symbolic Solution Rate of 0.
  • Figure 5: Robustness of models under varying noise levels. Results are averaged over 8 low-dimensional benchmarks at three noise intensities (0, 0.01, and 0.1).
  • ...and 12 more figures

Theorems & Definitions (4)

  • Theorem 1
  • proof
  • Theorem 2
  • proof