Table of Contents
Fetching ...

"Is ChatGPT a Better Explainer than My Professor?": Evaluating the Explanation Capabilities of LLMs in Conversation Compared to a Human Baseline

Grace Li, Milad Alshomary, Smaranda Muresan

TL;DR

This work assesses whether Large Language Models can outperform human experts in generating conversational explanations by comparing three prompting strategies on the WIRED 5 Levels dataset across 11 STEM topics. Using 8 evaluation dimensions and ranking per task, GPT-Standard prompts generally outperformed human baselines and EA-guided prompts in overall rank outcomes, though EA prompts offered stronger engagement through targeted follow-ups. Inter-annotator reliability was moderate for rankings, highlighting variability in human judgments, and conciseness emerged as a key factor in effectiveness. The study concludes that LLMs can meaningfully augment expert explainers in real-time dialogue, with future work focusing on personalization and interface design to adapt explanations to individual explainees.

Abstract

Explanations form the foundation of knowledge sharing and build upon communication principles, social dynamics, and learning theories. We focus specifically on conversational approaches for explanations because the context is highly adaptive and interactive. Our research leverages previous work on explanatory acts, a framework for understanding the different strategies that explainers and explainees employ in a conversation to both explain, understand, and engage with the other party. We use the 5-Levels dataset was constructed from the WIRED YouTube series by Wachsmuth et al., and later annotated by Booshehri et al. with explanatory acts. These annotations provide a framework for understanding how explainers and explainees structure their response when crafting a response. With the rise of generative AI in the past year, we hope to better understand the capabilities of Large Language Models (LLMs) and how they can augment expert explainer's capabilities in conversational settings. To achieve this goal, the 5-Levels dataset (We use Booshehri et al.'s 2023 annotated dataset with explanatory acts.) allows us to audit the ability of LLMs in engaging in explanation dialogues. To evaluate the effectiveness of LLMs in generating explainer responses, we compared 3 different strategies, we asked human annotators to evaluate 3 different strategies: human explainer response, GPT4 standard response, GPT4 response with Explanation Moves.

"Is ChatGPT a Better Explainer than My Professor?": Evaluating the Explanation Capabilities of LLMs in Conversation Compared to a Human Baseline

TL;DR

This work assesses whether Large Language Models can outperform human experts in generating conversational explanations by comparing three prompting strategies on the WIRED 5 Levels dataset across 11 STEM topics. Using 8 evaluation dimensions and ranking per task, GPT-Standard prompts generally outperformed human baselines and EA-guided prompts in overall rank outcomes, though EA prompts offered stronger engagement through targeted follow-ups. Inter-annotator reliability was moderate for rankings, highlighting variability in human judgments, and conciseness emerged as a key factor in effectiveness. The study concludes that LLMs can meaningfully augment expert explainers in real-time dialogue, with future work focusing on personalization and interface design to adapt explanations to individual explainees.

Abstract

Explanations form the foundation of knowledge sharing and build upon communication principles, social dynamics, and learning theories. We focus specifically on conversational approaches for explanations because the context is highly adaptive and interactive. Our research leverages previous work on explanatory acts, a framework for understanding the different strategies that explainers and explainees employ in a conversation to both explain, understand, and engage with the other party. We use the 5-Levels dataset was constructed from the WIRED YouTube series by Wachsmuth et al., and later annotated by Booshehri et al. with explanatory acts. These annotations provide a framework for understanding how explainers and explainees structure their response when crafting a response. With the rise of generative AI in the past year, we hope to better understand the capabilities of Large Language Models (LLMs) and how they can augment expert explainer's capabilities in conversational settings. To achieve this goal, the 5-Levels dataset (We use Booshehri et al.'s 2023 annotated dataset with explanatory acts.) allows us to audit the ability of LLMs in engaging in explanation dialogues. To evaluate the effectiveness of LLMs in generating explainer responses, we compared 3 different strategies, we asked human annotators to evaluate 3 different strategies: human explainer response, GPT4 standard response, GPT4 response with Explanation Moves.
Paper Structure (12 sections, 4 figures, 2 tables)

This paper contains 12 sections, 4 figures, 2 tables.

Figures (4)

  • Figure 1: A sample annotated conversation between an explainer and explainee that has been labeled with a subset of the 20 explanatory acts. The figure illustrates the span-level, finegrain annotation framework of Booshehri et al.
  • Figure 2: The three different study conditions.
  • Figure 3: A sample image of the annotation interface with some of the rating questions omitted. The interface contains 3 columns of 8 rows with 5-star ratings in each of the columns to evaluate the explainer responses on 8-dimensions: coherence, conciseness, conversational nature, appropriateness, acknowledgement, active guidance, engagement, and depth or expansiveness
  • Figure 4: List of explanatory moves in our proposed annotation scheme along with their descriptions, arranged in alphabetical order