Table of Contents
Fetching ...

Characterizing Large Language Models as Rationalizers of Knowledge-intensive Tasks

Aditi Mishra, Sajjadur Rahman, Hannah Kim, Kushan Mitra, Estevam Hruschka

TL;DR

This work investigates using large language models to generate knowledge-grounded rationales for knowledge-intensive tasks such as CSQA and OBQA by grounding prompts in external sources like ConceptNet. A retrieval-augmented, few-shot prompting approach conditions LLM outputs on fetched facts and expert exemplars to produce corroborating and refuting rationales. Across multiple human studies, LLM-generated rationales are often preferred to crowdsourced alternatives, but remain imperfect in conciseness and novelty, and faithful rationalization of incorrect predictions can erode user trust. To address this, the authors propose a review-then-rationalize pipeline with a self-consistency-based reviewer that intervenes on potential errors, achieving substantial interruption of incorrect predictions and improving the credibility of explanations for real-world use.

Abstract

Large language models (LLMs) are proficient at generating fluent text with minimal task-specific supervision. Yet, their ability to provide well-grounded rationalizations for knowledge-intensive tasks remains under-explored. Such tasks, like commonsense multiple-choice questions, require rationales based on world knowledge to support predictions and refute alternate options. We consider the task of generating knowledge-guided rationalization in natural language by using expert-written examples in a few-shot manner. Surprisingly, crowd-workers preferred knowledge-grounded rationales over crowdsourced rationalizations, citing their factuality, sufficiency, and comprehensive refutations. Although LLMs-generated rationales were preferable, further improvements in conciseness and novelty are required. In another study, we show how rationalization of incorrect model predictions erodes humans' trust in LLM-generated rationales. Motivated by these observations, we create a two-stage pipeline to review task predictions and eliminate potential incorrect decisions before rationalization, enabling trustworthy rationale generation.

Characterizing Large Language Models as Rationalizers of Knowledge-intensive Tasks

TL;DR

This work investigates using large language models to generate knowledge-grounded rationales for knowledge-intensive tasks such as CSQA and OBQA by grounding prompts in external sources like ConceptNet. A retrieval-augmented, few-shot prompting approach conditions LLM outputs on fetched facts and expert exemplars to produce corroborating and refuting rationales. Across multiple human studies, LLM-generated rationales are often preferred to crowdsourced alternatives, but remain imperfect in conciseness and novelty, and faithful rationalization of incorrect predictions can erode user trust. To address this, the authors propose a review-then-rationalize pipeline with a self-consistency-based reviewer that intervenes on potential errors, achieving substantial interruption of incorrect predictions and improving the credibility of explanations for real-world use.

Abstract

Large language models (LLMs) are proficient at generating fluent text with minimal task-specific supervision. Yet, their ability to provide well-grounded rationalizations for knowledge-intensive tasks remains under-explored. Such tasks, like commonsense multiple-choice questions, require rationales based on world knowledge to support predictions and refute alternate options. We consider the task of generating knowledge-guided rationalization in natural language by using expert-written examples in a few-shot manner. Surprisingly, crowd-workers preferred knowledge-grounded rationales over crowdsourced rationalizations, citing their factuality, sufficiency, and comprehensive refutations. Although LLMs-generated rationales were preferable, further improvements in conciseness and novelty are required. In another study, we show how rationalization of incorrect model predictions erodes humans' trust in LLM-generated rationales. Motivated by these observations, we create a two-stage pipeline to review task predictions and eliminate potential incorrect decisions before rationalization, enabling trustworthy rationale generation.
Paper Structure (33 sections, 14 figures, 12 tables)

This paper contains 33 sections, 14 figures, 12 tables.

Figures (14)

  • Figure 1: a) A commonsense question with multiple choices and knowledge extracted from ConceptNet and b) proposed LLM-generated rationale corroborating the selected answer and refuting the other choices.
  • Figure 2: Given an Input ( i.e., QA and model prediction), an LLM is prompted to generate a rationale with few-shot examples sampled from an expert-written pool.
  • Figure 3: An example in the few-shot prompt: the QA and External Knowledge components are retrieved and the topic and the rationale are expert authored.
  • Figure 4: Distribution of fine-tuned metrics between human-written (ECQA) and LLM-generated rationales --- LLM-generated rationales were preferred over ECQA on majority of the metrics except conciseness.
  • Figure 5: Crowdworkers' ratings showed similar distrbution for all metrics except insightfulness and concisenes. These metrics were rated lower for the more subjective CSQA dataset compared to the objective and scientific OBQA dataset.
  • ...and 9 more figures