Table of Contents
Fetching ...

VirtualXAI: A User-Centric Framework for Explainability Assessment Leveraging GPT-Generated Personas

Georgios Makridis, Vasileios Koukos, Georgios Fatouros, Dimosthenis Kyriazis

TL;DR

The paper tackles the challenge of evaluating explainability in AI by combining quantitative benchmarks with qualitative, user-centered insights. It introduces an XAI scoring framework that fuses fidelity, simplicity, stability, and accuracy metrics with LLM-generated virtual personas and a content-based recommender to tailor method and dataset choices for tabular data. Three core contributions are a tabular-data XAI scoring framework, an LLM-based qualitative assessment procedure, and a content-based recommender that maps dataset characteristics to historical benchmarks for domain-specific recommendations. The approach demonstrates that domain context matters for XAI method selection and emphasizes user-perceived interpretability alongside technical metrics, enabling more effective human-AI collaboration and practical XAI deployment.

Abstract

In today's data-driven era, computational systems generate vast amounts of data that drive the digital transformation of industries, where Artificial Intelligence (AI) plays a key role. Currently, the demand for eXplainable AI (XAI) has increased to enhance the interpretability, transparency, and trustworthiness of AI models. However, evaluating XAI methods remains challenging: existing evaluation frameworks typically focus on quantitative properties such as fidelity, consistency, and stability without taking into account qualitative characteristics such as satisfaction and interpretability. In addition, practitioners face a lack of guidance in selecting appropriate datasets, AI models, and XAI methods -a major hurdle in human-AI collaboration. To address these gaps, we propose a framework that integrates quantitative benchmarking with qualitative user assessments through virtual personas based on the "Anthology" of backstories of the Large Language Model (LLM). Our framework also incorporates a content-based recommender system that leverages dataset-specific characteristics to match new input data with a repository of benchmarked datasets. This yields an estimated XAI score and provides tailored recommendations for both the optimal AI model and the XAI method for a given scenario.

VirtualXAI: A User-Centric Framework for Explainability Assessment Leveraging GPT-Generated Personas

TL;DR

The paper tackles the challenge of evaluating explainability in AI by combining quantitative benchmarks with qualitative, user-centered insights. It introduces an XAI scoring framework that fuses fidelity, simplicity, stability, and accuracy metrics with LLM-generated virtual personas and a content-based recommender to tailor method and dataset choices for tabular data. Three core contributions are a tabular-data XAI scoring framework, an LLM-based qualitative assessment procedure, and a content-based recommender that maps dataset characteristics to historical benchmarks for domain-specific recommendations. The approach demonstrates that domain context matters for XAI method selection and emphasizes user-perceived interpretability alongside technical metrics, enabling more effective human-AI collaboration and practical XAI deployment.

Abstract

In today's data-driven era, computational systems generate vast amounts of data that drive the digital transformation of industries, where Artificial Intelligence (AI) plays a key role. Currently, the demand for eXplainable AI (XAI) has increased to enhance the interpretability, transparency, and trustworthiness of AI models. However, evaluating XAI methods remains challenging: existing evaluation frameworks typically focus on quantitative properties such as fidelity, consistency, and stability without taking into account qualitative characteristics such as satisfaction and interpretability. In addition, practitioners face a lack of guidance in selecting appropriate datasets, AI models, and XAI methods -a major hurdle in human-AI collaboration. To address these gaps, we propose a framework that integrates quantitative benchmarking with qualitative user assessments through virtual personas based on the "Anthology" of backstories of the Large Language Model (LLM). Our framework also incorporates a content-based recommender system that leverages dataset-specific characteristics to match new input data with a repository of benchmarked datasets. This yields an estimated XAI score and provides tailored recommendations for both the optimal AI model and the XAI method for a given scenario.

Paper Structure

This paper contains 16 sections, 6 figures, 1 table, 2 algorithms.

Figures (6)

  • Figure 1: Integrated system architecture combining quantitative evaluation, qualitative assessment, and survey-informed insights. The recommender system uses these inputs to generate an XAI score and recommend AI and XAI methods based on the dataset’s characteristics and user preferences.
  • Figure 2: High-level pipeline for the quantitative evaluation of XAI methods. Data is ingested and preprocessed, then a trained AI model is explained using SHAP, LIME, PFI, or PDP
  • Figure 3: Overview of the qualitative assessment approach. Virtual personas are generated, and their feedback on the XAI outputs is collected through structured surveys.
  • Figure 4: Number of datasets per domain. The health and medicine domain has the highest dataset count, reflecting a significant interest in clinical AI applications.
  • Figure 5: Average fidelity by domain and XAI method. Higher bars indicate stronger alignment between the explanation and the model's predictions.
  • ...and 1 more figures