Utilizing Large Language Models for Machine Learning Explainability
Alexandros Vassiliades, Nikolaos Polatidis, Stamatios Samaras, Sotiris Diplaris, Ignacio Cabrera Martin, Yannis Manolopoulos, Stefanos Vrochidis, Ioannis Kompatsiaris
TL;DR
This study investigates whether large language models (LLMs) can autonomously generate ML pipelines that are both effective and explainable. It prompts three LLMs to produce training pipelines for four classifiers on two tasks (binary driver alertness and multilabel yeast localization) and evaluates performance alongside SHAP-based explanations. Results show near-perfect performance on the binary task with highly faithful and concise SHAP explanations, while multilabel results are more varying in accuracy but still exhibit faithful explanations with moderate sparsity; Claude and DeepSeek often outperform GPT in multilabel settings. The findings suggest LLM-driven pipeline generation can yield interpretable AI solutions, though limitations include reliance on SHAP as the sole explainability metric and variability introduced by prompting, indicating avenues for broader tasks and causal/fairness-aware evaluation.
Abstract
This study explores the explainability capabilities of large language models (LLMs), when employed to autonomously generate machine learning (ML) solutions. We examine two classification tasks: (i) a binary classification problem focused on predicting driver alertness states, and (ii) a multilabel classification problem based on the yeast dataset. Three state-of-the-art LLMs (i.e. OpenAI GPT, Anthropic Claude, and DeepSeek) are prompted to design training pipelines for four common classifiers: Random Forest, XGBoost, Multilayer Perceptron, and Long Short-Term Memory networks. The generated models are evaluated in terms of predictive performance (recall, precision, and F1-score) and explainability using SHAP (SHapley Additive exPlanations). Specifically, we measure Average SHAP Fidelity (Mean Squared Error between SHAP approximations and model outputs) and Average SHAP Sparsity (number of features deemed influential). The results reveal that LLMs are capable of producing effective and interpretable models, achieving high fidelity and consistent sparsity, highlighting their potential as automated tools for interpretable ML pipeline generation. The results show that LLMs can produce effective, interpretable pipelines with high fidelity and consistent sparsity, closely matching manually engineered baselines.
