Table of Contents
Fetching ...

UNcommonsense Reasoning: Abductive Reasoning about Uncommon Situations

Wenting Zhao, Justin T Chiu, Jena D. Hwang, Faeze Brahman, Jack Hessel, Sanjiban Choudhury, Yejin Choi, Xiang Lorraine Li, Alane Suhr

TL;DR

UNcommonsense introduces a large English-language benchmark for abductive reasoning about uncommon events, pairing contexts with unlikely outcomes and crowd-assembled explanations. It analyzes how humans and LLMs generate explanations, and proposes two online imitation-learning methods (EaO and SED) to improve open models beyond supervised fine-tuning, achieving notable gains (roughly 10 percentage points) against GPT-4 baselines. The dataset combines un-SocialIQA and un-RocStories sources, totaling 20,947 context–outcome pairs and over 40k crowd-based explanations plus millions worth of LLM-generated variants, enabling rich analyses of explanation quality, diversity, and length. The work demonstrates that leveraging expert behavior online, particularly via Expert as Oracle, can substantially reduce losses and improve abductive reasoning in models with restricted access, with implications for fairness and reliability in handling rare, high-stakes situations.

Abstract

Language technologies that accurately model the dynamics of events must perform commonsense reasoning. Existing work evaluating commonsense reasoning focuses on making inferences about common, everyday situations. To instead investigate the ability to model unusual, unexpected, and unlikely situations, we explore the task of uncommonsense abductive reasoning. Given a piece of context with an unexpected outcome, this task requires reasoning abductively to generate an explanation that makes the unexpected outcome more likely in the context. To this end, we curate and release a new English language corpus called UNcommonsense. We characterize the performance differences between human explainers and the best-performing large language models, finding that model-enhanced human-written explanations achieve the highest quality by trading off between specificity and diversity. Finally, we experiment with several imitation learning algorithms to train open and accessible language models on this task. When compared with the vanilla supervised fine-tuning approach, these methods consistently reduce lose rates on both common and uncommonsense abductive reasoning judged by human evaluators.

UNcommonsense Reasoning: Abductive Reasoning about Uncommon Situations

TL;DR

UNcommonsense introduces a large English-language benchmark for abductive reasoning about uncommon events, pairing contexts with unlikely outcomes and crowd-assembled explanations. It analyzes how humans and LLMs generate explanations, and proposes two online imitation-learning methods (EaO and SED) to improve open models beyond supervised fine-tuning, achieving notable gains (roughly 10 percentage points) against GPT-4 baselines. The dataset combines un-SocialIQA and un-RocStories sources, totaling 20,947 context–outcome pairs and over 40k crowd-based explanations plus millions worth of LLM-generated variants, enabling rich analyses of explanation quality, diversity, and length. The work demonstrates that leveraging expert behavior online, particularly via Expert as Oracle, can substantially reduce losses and improve abductive reasoning in models with restricted access, with implications for fairness and reliability in handling rare, high-stakes situations.

Abstract

Language technologies that accurately model the dynamics of events must perform commonsense reasoning. Existing work evaluating commonsense reasoning focuses on making inferences about common, everyday situations. To instead investigate the ability to model unusual, unexpected, and unlikely situations, we explore the task of uncommonsense abductive reasoning. Given a piece of context with an unexpected outcome, this task requires reasoning abductively to generate an explanation that makes the unexpected outcome more likely in the context. To this end, we curate and release a new English language corpus called UNcommonsense. We characterize the performance differences between human explainers and the best-performing large language models, finding that model-enhanced human-written explanations achieve the highest quality by trading off between specificity and diversity. Finally, we experiment with several imitation learning algorithms to train open and accessible language models on this task. When compared with the vanilla supervised fine-tuning approach, these methods consistently reduce lose rates on both common and uncommonsense abductive reasoning judged by human evaluators.
Paper Structure (40 sections, 1 equation, 12 figures, 18 tables, 2 algorithms)

This paper contains 40 sections, 1 equation, 12 figures, 18 tables, 2 algorithms.

Figures (12)

  • Figure 1: Given a context and an uncommon outcome, uncommonsense abductive reasoning aims to produce an explanation so that the unlikely outcome becomes likely. The explanation needs to follow the three rules noted with the check marks.
  • Figure 2: Qualitative comparison between LLM explanations, Crowd explanations, and C+LLM explanations. In Comments, we make connections to the three rules in explanation writing.
  • Figure 3: Distribution of explanation lengths in un-RocStories (top) and un-SocialIQA (bottom), computed on the development sets of each data subset.
  • Figure 4: Entropies of $n$-gram distributions in un-RocStories (left) and un-SocialIQA (right), computed on the development sets of each data subset.
  • Figure 5: Prompting template for combining a question and its answer.
  • ...and 7 more figures