Table of Contents
Fetching ...

CoLoTa: A Dataset for Entity-based Commonsense Reasoning over Long-Tail Knowledge

Armin Toroghi, Willis Guo, Scott Sanner

TL;DR

CoLoTa addresses the challenge of hallucinations and reasoning errors in entity-based commonsense reasoning when dealing with long-tail knowledge. It introduces a 3,300-query benchmark derived from StrategyQA and CREAK by substituting popular entities with obscure Wikidata entities, and it provides explicit Wikidata anchors, sub-graphs, inference rules, and step-by-step reasoning. The dataset doubles as a KGQA benchmark since all required knowledge is grounded in Wikidata, revealing that state-of-the-art KGQA methods struggle with long-tail and commonsense-enabled queries. Experimental results show significant drops in accuracy and increased hallucinations for strong LLMs on CoLoTa, while current KGQA baselines perform poorly, underlining the need for integrated approaches that fuse factual grounding with robust commonsense reasoning. Overall, CoLoTa offers a rigorous platform for evaluating and advancing LLM and KGQA capabilities in long-tail, entity-specific contexts.

Abstract

The rise of Large Language Models (LLMs) has redefined the AI landscape, particularly due to their ability to encode factual and commonsense knowledge, and their outstanding performance in tasks requiring reasoning. Despite these advances, hallucinations and reasoning errors remain a significant barrier to their deployment in high-stakes settings. In this work, we observe that even the most prominent LLMs, such as OpenAI-o1, suffer from high rates of reasoning errors and hallucinations on tasks requiring commonsense reasoning over obscure, long-tail entities. To investigate this limitation, we present a new dataset for Commonsense reasoning over Long-Tail entities (CoLoTa), that consists of 3,300 queries from question answering and claim verification tasks and covers a diverse range of commonsense reasoning skills. We remark that CoLoTa can also serve as a Knowledge Graph Question Answering (KGQA) dataset since the support of knowledge required to answer its queries is present in the Wikidata knowledge graph. However, as opposed to existing KGQA benchmarks that merely focus on factoid questions, our CoLoTa queries also require commonsense reasoning. Our experiments with strong LLM-based KGQA methodologies indicate their severe inability to answer queries involving commonsense reasoning. Hence, we propose CoLoTa as a novel benchmark for assessing both (i) LLM commonsense reasoning capabilities and their robustness to hallucinations on long-tail entities and (ii) the commonsense reasoning capabilities of KGQA methods.

CoLoTa: A Dataset for Entity-based Commonsense Reasoning over Long-Tail Knowledge

TL;DR

CoLoTa addresses the challenge of hallucinations and reasoning errors in entity-based commonsense reasoning when dealing with long-tail knowledge. It introduces a 3,300-query benchmark derived from StrategyQA and CREAK by substituting popular entities with obscure Wikidata entities, and it provides explicit Wikidata anchors, sub-graphs, inference rules, and step-by-step reasoning. The dataset doubles as a KGQA benchmark since all required knowledge is grounded in Wikidata, revealing that state-of-the-art KGQA methods struggle with long-tail and commonsense-enabled queries. Experimental results show significant drops in accuracy and increased hallucinations for strong LLMs on CoLoTa, while current KGQA baselines perform poorly, underlining the need for integrated approaches that fuse factual grounding with robust commonsense reasoning. Overall, CoLoTa offers a rigorous platform for evaluating and advancing LLM and KGQA capabilities in long-tail, entity-specific contexts.

Abstract

The rise of Large Language Models (LLMs) has redefined the AI landscape, particularly due to their ability to encode factual and commonsense knowledge, and their outstanding performance in tasks requiring reasoning. Despite these advances, hallucinations and reasoning errors remain a significant barrier to their deployment in high-stakes settings. In this work, we observe that even the most prominent LLMs, such as OpenAI-o1, suffer from high rates of reasoning errors and hallucinations on tasks requiring commonsense reasoning over obscure, long-tail entities. To investigate this limitation, we present a new dataset for Commonsense reasoning over Long-Tail entities (CoLoTa), that consists of 3,300 queries from question answering and claim verification tasks and covers a diverse range of commonsense reasoning skills. We remark that CoLoTa can also serve as a Knowledge Graph Question Answering (KGQA) dataset since the support of knowledge required to answer its queries is present in the Wikidata knowledge graph. However, as opposed to existing KGQA benchmarks that merely focus on factoid questions, our CoLoTa queries also require commonsense reasoning. Our experiments with strong LLM-based KGQA methodologies indicate their severe inability to answer queries involving commonsense reasoning. Hence, we propose CoLoTa as a novel benchmark for assessing both (i) LLM commonsense reasoning capabilities and their robustness to hallucinations on long-tail entities and (ii) the commonsense reasoning capabilities of KGQA methods.

Paper Structure

This paper contains 13 sections, 2 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Constituents of an entry from CoLoTa: (i) an entity-based commonsense reasoning query (ii) Wikidata QIDs of the anchor entities, (iii) relevant Wikidata sub-graph that contains factual information to answer the query, (iv) an inference rule establishing the commonsense reasoning required to answer the query, and (v) reasoning steps to conclude the final answer.
  • Figure 2: Distribution of reasoning skills in question answering task. Orange (Blue) circles show domain-independent (domain-dependent) reasoning skills.
  • Figure 3: Distribution of reasoning skills in claim verification task. Orange (Blue) circles show domain-independent (domain-dependent) reasoning skills.
  • Figure 4: Distribution of popularity of the entities targeted in claim verification task of CoLoTa vs. the original queries, indicating CoLoTa's focus on long-tail entities.
  • Figure 5: Distribution of popularity of the entities targeted in question answering task of CoLoTa vs. the original queries, indicating CoLoTa's focus on long-tail entities.
  • ...and 1 more figures