Table of Contents
Fetching ...

Agentic-imodels: Evolving agentic interpretability tools via autoresearch

Chandan Singh, Yan Shuo Tan, Weijia Xu, Zelalem Gero, Weiwei Yang, Michel Galley, Jianfeng Gao

Abstract

Agentic data science (ADS) systems are rapidly improving their capability to autonomously analyze, fit, and interpret data, potentially moving towards a future where agents conduct the vast majority of data-science work. However, current ADS systems use statistical tools designed to be interpretable by humans, rather than interpretable by agents. To address this, we introduce Agentic-imodels, an agentic autoresearch loop that evolves data-science tools designed to be interpretable by agents. Specifically, it develops a library of scikit-learn-compatible regressors for tabular data that are optimized for both predictive performance and a novel LLM-based interpretability metric. The metric measures a suite of LLM-graded tests that probe whether a fitted model's string representation is "simulatable" by an LLM, i.e. whether the LLM can answer questions about the model's behavior by reading its string output alone. We find that the evolved models jointly improve predictive performance and agent-facing interpretability, generalizing to new datasets and new interpretability tests. Furthermore, these evolved models improve downstream end-to-end ADS, increasing performance for Copilot CLI, Claude Code, and Codex on the BLADE benchmark by up to 73%

Agentic-imodels: Evolving agentic interpretability tools via autoresearch

Abstract

Agentic data science (ADS) systems are rapidly improving their capability to autonomously analyze, fit, and interpret data, potentially moving towards a future where agents conduct the vast majority of data-science work. However, current ADS systems use statistical tools designed to be interpretable by humans, rather than interpretable by agents. To address this, we introduce Agentic-imodels, an agentic autoresearch loop that evolves data-science tools designed to be interpretable by agents. Specifically, it develops a library of scikit-learn-compatible regressors for tabular data that are optimized for both predictive performance and a novel LLM-based interpretability metric. The metric measures a suite of LLM-graded tests that probe whether a fitted model's string representation is "simulatable" by an LLM, i.e. whether the LLM can answer questions about the model's behavior by reading its string output alone. We find that the evolved models jointly improve predictive performance and agent-facing interpretability, generalizing to new datasets and new interpretability tests. Furthermore, these evolved models improve downstream end-to-end ADS, increasing performance for Copilot CLI, Claude Code, and Codex on the BLADE benchmark by up to 73%

Paper Structure

This paper contains 49 sections, 3 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: (a) Overview of the Agentic-imodels autoresearch loop, which optimizes a Python class for predictive performance and agent interpretability (evaluated through LLM-based simulatability tests). (b) The discovered Agentic-imodels (blue points) improve the Pareto frontier of predictive performance and interpretability over baselines from the literature. See evaluation details in \ref{['sec:main_results']}.
  • Figure 2: The interpretability test protocol, illustrated on Ridge regression with four of the 43 tests. Synthetic data is generated and the model is fit. The LLM receives only the __str__ output and a query. The response is graded against the ground truth, which is obtained either from evaluating the fitted function or from knowledge of the ground-truth function. In some cases, the question itself requires evaluating the prediction function, e.g. in the counterfactual target test the value 8.50 is obtained by evaluating the model when setting $x_0 = 2.18, x_1 = 0.0, x_2 = 0.0$. Here the LLM passes three tests but fails the counterfactual target test. Even though the information to compute the counterfactual target is present in the __str__ output, the representation does not make it easy for the LLM to solve the inverse problem compared to other model representations (e.g. a decision tree that makes the predicted value easily apparent in leaf nodes).
  • Figure 3: Agentic-imodels versus baselines (gray crosses) in terms of both predictive performance (the RMSE mean rank: each model's mean rank is computed across datasets, then normalized to $[0, 1]$ with lower being better) and agent interpretability scores (fraction of tests passed from the 157-test held-out generalization suite (\ref{['tab:gen_tests']})). Across different settings, Agentic-imodels achieve Pareto improvements in terms of predictive performance and interpretability: (a) Claude Opus-4 models across three effort levels, (b) Codex (GPT-5.3) models across three effort levels, (c) Claude Opus-4.7 at medium effort with 3 random repetitions. (d) Agent interpretability scores on the development set of tests versus the held-out set for four of the runs shown in (a)/(b) (using matching colors/markers). Some models exhibit held-out agent interpretability scores significantly below their development scores, suggesting potential reward hacking. Excluding the points in this shaded region, the remaining points show a strong positive correlation ($r=0.65$).
  • Figure 4: Including Agentic-imodels improves performance on the BLADE benchmark across 4 ADS agents: GitHub Copilot CLI (gemini-2.5-pro), GitHub Copilot CLI (sonnet-4.5), Claude Code (sonnet-4.6), and Codex CLI (GPT-5.3). (a) Aggregate scores across the 13 BLADE datasets, with four prompt conditions per agent: standard tools (no explicit package emphasis), prompt emphasizing the imodels package, prompt emphasizing the interpretML package, and prompt emphasizing our Agentic-imodels package. Including Agentic-imodels yields substantial improvements over all the other conditions, across different agents and evaluation axes, particularly for the weaker ADS systems where the margin for improvement is larger. (b) Per-dataset average score (mean of correctness, completeness, clarity) with performance using standard tools versus Agentic-imodels; points above the diagonal indicate improvement. Error bars show standard error of the mean across agent seeds and judge seeds (9 evaluations per dataset per condition: 3 agent runs $\times$ 3 judge runs).
  • Figure B1: Post-hoc evaluator-sensitivity analysis on the held-out interpretability tests (41 baseline + evolved models from the Claude Opus-4.6 medium-effort run). Top row: GPT-5.4 evaluator (same prompt template as the in-loop GPT-4o evaluator). Bottom row: Claude Haiku-4.5 evaluator with a slightly perturbed prompt. (a, c) Normalized prediction rank on the 65 original training datasets (y-axis matching \ref{['fig:analysis']}a--c) versus agent interpretability score under each evaluator; baselines are colored crosses by category and evolved models are blue. The evolved Pareto frontier is preserved under both evaluators. (b, d) Per-model agent interpretability score under each held-out evaluator (x-axis) versus the original GPT-4o score (y-axis); baselines are gray, evolved models are blue. Scores correlate strongly with GPT-4o under both evaluators ($r = 0.83$ for GPT-5.4, $r = 0.85$ for Haiku), but with opposite biases: GPT-5.4 is the stricter evaluator (most points above the diagonal), whereas Haiku is more lenient (most points below the diagonal).
  • ...and 2 more figures