Table of Contents
Fetching ...

Can LLM Assist in the Evaluation of the Quality of Machine Learning Explanations?

Bo Wang, Yiqiao Li, Jianlong Zhou, Fang Chen

TL;DR

This study addresses how to evaluate machine-learning explanations by comparing Transformer-based LLM judges (GPT-4o and Mistral-7.2B) with human judges in an iris classification task. It proposes a workflow that blends LLM-based and human judgments and employs three explanation methods (LIME, similarity-based, and no explanation) with both subjective and objective metrics. The findings show LLM-based judges align with human judgments on subjective aspects but fall short on objective accuracy, indicating that LLMs can complement but not replace human evaluators in explainability assessment. The work highlights the potential for cost-efficient, human-centric evaluation augmented by carefully designed prompts and calibration, guiding future research toward more robust LLM-assisted explainability evaluation across diverse data and explanations.

Abstract

EXplainable machine learning (XML) has recently emerged to address the mystery mechanisms of machine learning (ML) systems by interpreting their 'black box' results. Despite the development of various explanation methods, determining the most suitable XML method for specific ML contexts remains unclear, highlighting the need for effective evaluation of explanations. The evaluating capabilities of the Transformer-based large language model (LLM) present an opportunity to adopt LLM-as-a-Judge for assessing explanations. In this paper, we propose a workflow that integrates both LLM-based and human judges for evaluating explanations. We examine how LLM-based judges evaluate the quality of various explanation methods and compare their evaluation capabilities to those of human judges within an iris classification scenario, employing both subjective and objective metrics. We conclude that while LLM-based judges effectively assess the quality of explanations using subjective metrics, they are not yet sufficiently developed to replace human judges in this role.

Can LLM Assist in the Evaluation of the Quality of Machine Learning Explanations?

TL;DR

This study addresses how to evaluate machine-learning explanations by comparing Transformer-based LLM judges (GPT-4o and Mistral-7.2B) with human judges in an iris classification task. It proposes a workflow that blends LLM-based and human judgments and employs three explanation methods (LIME, similarity-based, and no explanation) with both subjective and objective metrics. The findings show LLM-based judges align with human judgments on subjective aspects but fall short on objective accuracy, indicating that LLMs can complement but not replace human evaluators in explainability assessment. The work highlights the potential for cost-efficient, human-centric evaluation augmented by carefully designed prompts and calibration, guiding future research toward more robust LLM-assisted explainability evaluation across diverse data and explanations.

Abstract

EXplainable machine learning (XML) has recently emerged to address the mystery mechanisms of machine learning (ML) systems by interpreting their 'black box' results. Despite the development of various explanation methods, determining the most suitable XML method for specific ML contexts remains unclear, highlighting the need for effective evaluation of explanations. The evaluating capabilities of the Transformer-based large language model (LLM) present an opportunity to adopt LLM-as-a-Judge for assessing explanations. In this paper, we propose a workflow that integrates both LLM-based and human judges for evaluating explanations. We examine how LLM-based judges evaluate the quality of various explanation methods and compare their evaluation capabilities to those of human judges within an iris classification scenario, employing both subjective and objective metrics. We conclude that while LLM-based judges effectively assess the quality of explanations using subjective metrics, they are not yet sufficiently developed to replace human judges in this role.

Paper Structure

This paper contains 41 sections, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Workflow of Judges for Evaluating Explanations.
  • Figure 2: The Prompt for LLMs Evaluating Explanations. We provide (1) LLMs role, (2) task description, and (3) contextual information
  • Figure 3: An Interface Example of Tasks in the Online User Study.
  • Figure 4: Results for Judges across Explanations Based on Subjective and Objective Metrics. In this figure, error bars represent the 95% confidence interval of a mean. The (a), (b), (c), (d), and (e) refer to subjective metrics including understandability, satisfaction, completeness, usefulness, and trustworthiness, respectively. The (f) refers to the objective metric - accuracy.