From Data to Commonsense Reasoning: The Use of Large Language Models for Explainable AI

Stefanie Krause; Frieder Stolzenburg

From Data to Commonsense Reasoning: The Use of Large Language Models for Explainable AI

Stefanie Krause, Frieder Stolzenburg

TL;DR

This work investigates whether large language models (LLMs) can perform commonsense reasoning at near-human levels and provide human-understandable explanations in QA tasks. It evaluates three models—GPT-3.5, Gemma, and Llama 3—across $11$ diverse QA benchmarks (sampling $30$ examples per dataset) and supplements the findings with a questionnaire to assess explanation quality. Results show Llama 3 achieving a mean accuracy of $90\%$ across datasets and surpassing human performance by $21\%$ on average, while GPT-3.5 and Gemma hover around the $72\%$ range; explanations from GPT-3.5 are generally rated good to excellent, with a positive link between task comprehensibility and explanation quality. The study demonstrates the potential of LLMs for zero-shot commonsense reasoning and explainable AI, while highlighting limitations (e.g., context gaps, medical knowledge, semantic relations) and the need for careful evaluation of explanations and user concerns in real-world deployments.

Abstract

Commonsense reasoning is a difficult task for a computer, but a critical skill for an artificial intelligence (AI). It can enhance the explainability of AI models by enabling them to provide intuitive and human-like explanations for their decisions. This is necessary in many areas especially in question answering (QA), which is one of the most important tasks of natural language processing (NLP). Over time, a multitude of methods have emerged for solving commonsense reasoning problems such as knowledge-based approaches using formal logic or linguistic analysis. In this paper, we investigate the effectiveness of large language models (LLMs) on different QA tasks with a focus on their abilities in reasoning and explainability. We study three LLMs: GPT-3.5, Gemma and Llama 3. We further evaluate the LLM results by means of a questionnaire. We demonstrate the ability of LLMs to reason with commonsense as the models outperform humans on different datasets. While GPT-3.5's accuracy ranges from 56% to 93% on various QA benchmarks, Llama 3 achieved a mean accuracy of 90% on all eleven datasets. Thereby Llama 3 is outperforming humans on all datasets with an average 21% higher accuracy over ten datasets. Furthermore, we can appraise that, in the sense of explainable artificial intelligence (XAI), GPT-3.5 provides good explanations for its decisions. Our questionnaire revealed that 66% of participants rated GPT-3.5's explanations as either "good" or "excellent". Taken together, these findings enrich our understanding of current LLMs and pave the way for future investigations of reasoning and explainability.

From Data to Commonsense Reasoning: The Use of Large Language Models for Explainable AI

TL;DR

diverse QA benchmarks (sampling

examples per dataset) and supplements the findings with a questionnaire to assess explanation quality. Results show Llama 3 achieving a mean accuracy of

across datasets and surpassing human performance by

on average, while GPT-3.5 and Gemma hover around the

range; explanations from GPT-3.5 are generally rated good to excellent, with a positive link between task comprehensibility and explanation quality. The study demonstrates the potential of LLMs for zero-shot commonsense reasoning and explainable AI, while highlighting limitations (e.g., context gaps, medical knowledge, semantic relations) and the need for careful evaluation of explanations and user concerns in real-world deployments.

Abstract

Paper Structure (19 sections, 6 figures, 3 tables)

This paper contains 19 sections, 6 figures, 3 tables.

Introduction
Related Work
Commonsense Reasoning Approaches
LLMs
Combining LLMs and Reasoning
Evaluating LLMs on QA Tasks
Benchmark Datasets
LLM Results on Different Datasets
Analysis of LLM Results
Questionnaire
Design of the Questionnaire
Questionnaire Participants
Questionnaire Responses
Discussion and Future Directions
Main Findings
...and 4 more sections

Figures (6)

Figure 1: Example for a invalid response from GPT-3.5 due to insufficient context information (COPA example 612).
Figure 2: Example of one QA task with the common structure (ARC example). Only after participants answer the first two questions the next two questions with the possible response and explanation are shown.
Figure 3: Comparison of accuracy of human questionnaire participants (organge) and three different LLMs: GPT-3.5, Gemma and Llama 3 (blue) on ten different datasets.
Figure 4: Participants’ rating of GPT-3.5's explanation quality on a 5 Likert scale from "very poor" to "excellent".
Figure 5: Participants believe how many explanations from zero to twenty are generated by an AI. Actually all 20 explanations were generated by GPT-3.5.
...and 1 more figures

From Data to Commonsense Reasoning: The Use of Large Language Models for Explainable AI

TL;DR

Abstract

From Data to Commonsense Reasoning: The Use of Large Language Models for Explainable AI

Authors

TL;DR

Abstract

Table of Contents

Figures (6)