Table of Contents
Fetching ...

Automated Model Discovery via Multi-modal & Multi-step Pipeline

Lee Jung-Mok, Nam Hyeon-Woo, Moon Ye-Bin, Junhyun Nam, Tae-Hyun Oh

TL;DR

The paper tackles automated model discovery in large, complex search spaces by introducing a multi-modal, multi-step pipeline powered by two vision-language-based agents: AnalyzerVLM for iterative analysis and candidate-model proposal, and EvaluatorVLM for perception-based model evaluation via the Visual Information Criterion (VIC). VIC combines a visual fitness/generalizability score with the traditional BIC to better capture structural consistency and extrapolation behavior, thereby guiding model selection toward both local accuracy and broader generalization. The approach is demonstrated on Gaussian process kernel discovery and extended to symbolic regression, with extensive ablations showing the crucial roles of modality, reasoning steps, and visual evaluation. Results indicate improved generalization and robustness over traditional kernel-search methods and other LLM/VLM-based baselines, highlighting practical potential for automated, interpretable model discovery in real-world datasets. The work also discusses limitations and avenues for extending the framework to multivariate data and improved visualization-driven evaluation.

Abstract

Automated model discovery is the process of automatically searching and identifying the most appropriate model for a given dataset over a large combinatorial search space. Existing approaches, however, often face challenges in balancing the capture of fine-grained details with ensuring generalizability beyond training data regimes with a reasonable model complexity. In this paper, we present a multi-modal \& multi-step pipeline for effective automated model discovery. Our approach leverages two vision-language-based modules (VLM), AnalyzerVLM and EvaluatorVLM, for effective model proposal and evaluation in an agentic way. AnalyzerVLM autonomously plans and executes multi-step analyses to propose effective candidate models. EvaluatorVLM assesses the candidate models both quantitatively and perceptually, regarding the fitness for local details and the generalibility for overall trends. Our results demonstrate that our pipeline effectively discovers models that capture fine details and ensure strong generalizability. Additionally, extensive ablation studies show that both multi-modality and multi-step reasoning play crucial roles in discovering favorable models.

Automated Model Discovery via Multi-modal & Multi-step Pipeline

TL;DR

The paper tackles automated model discovery in large, complex search spaces by introducing a multi-modal, multi-step pipeline powered by two vision-language-based agents: AnalyzerVLM for iterative analysis and candidate-model proposal, and EvaluatorVLM for perception-based model evaluation via the Visual Information Criterion (VIC). VIC combines a visual fitness/generalizability score with the traditional BIC to better capture structural consistency and extrapolation behavior, thereby guiding model selection toward both local accuracy and broader generalization. The approach is demonstrated on Gaussian process kernel discovery and extended to symbolic regression, with extensive ablations showing the crucial roles of modality, reasoning steps, and visual evaluation. Results indicate improved generalization and robustness over traditional kernel-search methods and other LLM/VLM-based baselines, highlighting practical potential for automated, interpretable model discovery in real-world datasets. The work also discusses limitations and avenues for extending the framework to multivariate data and improved visualization-driven evaluation.

Abstract

Automated model discovery is the process of automatically searching and identifying the most appropriate model for a given dataset over a large combinatorial search space. Existing approaches, however, often face challenges in balancing the capture of fine-grained details with ensuring generalizability beyond training data regimes with a reasonable model complexity. In this paper, we present a multi-modal \& multi-step pipeline for effective automated model discovery. Our approach leverages two vision-language-based modules (VLM), AnalyzerVLM and EvaluatorVLM, for effective model proposal and evaluation in an agentic way. AnalyzerVLM autonomously plans and executes multi-step analyses to propose effective candidate models. EvaluatorVLM assesses the candidate models both quantitatively and perceptually, regarding the fitness for local details and the generalibility for overall trends. Our results demonstrate that our pipeline effectively discovers models that capture fine details and ensure strong generalizability. Additionally, extensive ablation studies show that both multi-modality and multi-step reasoning play crucial roles in discovering favorable models.

Paper Structure

This paper contains 41 sections, 18 equations, 21 figures, 8 tables, 2 algorithms.

Figures (21)

  • Figure 1: Overall pipeline. Our multi-modal & multi-step pipeline contains 4 stages: model proposal, model fitting, model evaluation, and model selection. During the model proposal, AnalyzerVLM repeatedly analyzes the data until it determines the results are sufficient. Then, it proposes a model structure with linearity(LIN) and periodicity(PER), then the model fitting is conducted. In the model evaluation stage, EvaluatorVLM assesses models visually and utilizes the score in the subsequent model selection process.
  • Figure 2: Qualitative results. The graph on the left illustrates the data, where a black line stands for the observed data and a red line for the test data. The graphs on the right show the predictions of each method at the shaded region of the left graph, with the blue lines representing their outputs. The results highlight the generalization capabilities and efficient search performance of our method.
  • Figure 2: Ablation study on VLM-based modules in our pipeline. We conduct an ablation study of AnalyzerVLM and EvaluatorVLM in terms of MSE. The last row stands for our complete pipeline. Bold represents the best, and underline is the second best.
  • Figure 3: MSE of text only and multimodal representation. We restrict the usage of visual representation during model generation and evaluation processes. The blue value indicates the reduction in MSE when using visual representation.
  • Figure 4: The required step number of LLM and VLM. When using LLM instead of VLM for Analyzer, i.e., without visualizing the data and model, Analyzer requires more steps to validate its analysis.
  • ...and 16 more figures