UNcommonsense Reasoning: Abductive Reasoning about Uncommon Situations
Wenting Zhao, Justin T Chiu, Jena D. Hwang, Faeze Brahman, Jack Hessel, Sanjiban Choudhury, Yejin Choi, Xiang Lorraine Li, Alane Suhr
TL;DR
UNcommonsense introduces a large English-language benchmark for abductive reasoning about uncommon events, pairing contexts with unlikely outcomes and crowd-assembled explanations. It analyzes how humans and LLMs generate explanations, and proposes two online imitation-learning methods (EaO and SED) to improve open models beyond supervised fine-tuning, achieving notable gains (roughly 10 percentage points) against GPT-4 baselines. The dataset combines un-SocialIQA and un-RocStories sources, totaling 20,947 context–outcome pairs and over 40k crowd-based explanations plus millions worth of LLM-generated variants, enabling rich analyses of explanation quality, diversity, and length. The work demonstrates that leveraging expert behavior online, particularly via Expert as Oracle, can substantially reduce losses and improve abductive reasoning in models with restricted access, with implications for fairness and reliability in handling rare, high-stakes situations.
Abstract
Language technologies that accurately model the dynamics of events must perform commonsense reasoning. Existing work evaluating commonsense reasoning focuses on making inferences about common, everyday situations. To instead investigate the ability to model unusual, unexpected, and unlikely situations, we explore the task of uncommonsense abductive reasoning. Given a piece of context with an unexpected outcome, this task requires reasoning abductively to generate an explanation that makes the unexpected outcome more likely in the context. To this end, we curate and release a new English language corpus called UNcommonsense. We characterize the performance differences between human explainers and the best-performing large language models, finding that model-enhanced human-written explanations achieve the highest quality by trading off between specificity and diversity. Finally, we experiment with several imitation learning algorithms to train open and accessible language models on this task. When compared with the vanilla supervised fine-tuning approach, these methods consistently reduce lose rates on both common and uncommonsense abductive reasoning judged by human evaluators.
