MARS: Benchmarking the Metaphysical Reasoning Abilities of Language Models with a Multi-task Evaluation Dataset

Weiqi Wang; Yangqiu Song

MARS: Benchmarking the Metaphysical Reasoning Abilities of Language Models with a Multi-task Evaluation Dataset

Weiqi Wang, Yangqiu Song

TL;DR

The paper defines Metaphysical Reasoning as a three-step discriminative process to handle distributional changes triggered by actions and environmental factors, and introduces Mars, the first large-scale benchmark with 355K annotated entries across three tasks. It formalizes events as seven components and uses hierarchical abstractions to generate plausible and metaphysical changes, with a pipeline that includes text decomposition, component variation, inference generation, and transition generation, validated by large-scale human annotations. Across more than 20 models, Mars reveals that reasoning with distributional changes remains difficult even after fine-tuning, with limited gains from prompts and prompting strategies; however, pre-training on conceptual taxonomies (e.g., CANDLE) significantly boosts performance, suggesting a scalable path toward improved metaphysical reasoning in LMs. The work provides public data and models, outlining a robust framework for evaluating conscious processing in language models and guiding future research in out-of-distribution reasoning and System II-like capabilities.

Abstract

To enable Large Language Models (LLMs) to function as conscious agents with generalizable reasoning capabilities, it is crucial that they possess the reasoning ability to comprehend situational changes (transitions) in distribution triggered by environmental factors or actions from other agents. Despite its fundamental significance, this ability remains underexplored due to the complexity of modeling infinite possible changes in an event and their associated distributions, coupled with the lack of benchmark data with situational transitions. Addressing these gaps, we propose a novel formulation of reasoning with distributional changes as a three-step discriminative process, termed as MetAphysical ReaSoning. We then introduce the first-ever benchmark, MARS, comprising three tasks corresponding to each step. These tasks systematically assess LLMs' capabilities in reasoning the plausibility of (i) changes in actions, (ii) states caused by changed actions, and (iii) situational transitions driven by changes in action. Extensive evaluations with 20 (L)LMs of varying sizes and methods indicate that all three tasks in this process pose significant challenges, even for state-of-the-art LLMs and LMs after fine-tuning. Further analyses reveal potential causes for the underperformance of LLMs and demonstrate that pre-training them on large-scale conceptualization taxonomies can potentially enhance their metaphysical reasoning capabilities. Our data and models are publicly accessible at https://github.com/HKUST-KnowComp/MARS.

MARS: Benchmarking the Metaphysical Reasoning Abilities of Language Models with a Multi-task Evaluation Dataset

TL;DR

Abstract

Paper Structure (47 sections, 7 figures, 11 tables)

This paper contains 47 sections, 7 figures, 11 tables.

Introduction
Backgrounds and Related Works
Definitions of Changes in Event and Metaphysical Reasoning
Mars Benchmark Curation Pipeline
Text Decomposition and Extraction
Component Abstraction and Variation
Inference Generation
Metaphysical Transition Generation
Human Annotations
Evaluations and Analysis
Mars Statistics
Main Evaluations on Mars
Task Setup and Model Selections
Results and Analysis
Analysis
...and 32 more sections

Figures (7)

Figure 1: Examples of changes in event in our formulation. After changes occur, events may become metaphysical as components are abstracted into high-level concepts, while some remain plausible in reality.
Figure 2: The three steps in metaphysical reasoning. Our motivation behind this is that, by conquering all steps sequentially, a conscious agent could answer: (1) Will the change occur in reality? (2) What will the change cause? (3) What change can make a metaphysical (desired) inference plausible?
Figure 3: Hypernym distribution of the top 5,000 popular component variations.
Figure 4: Performances by component types of fine-tuned LLaMa3-8B on three tasks of Mars.
Figure 5: An overview of our benchmark curation pipeline with running examples.
...and 2 more figures

MARS: Benchmarking the Metaphysical Reasoning Abilities of Language Models with a Multi-task Evaluation Dataset

TL;DR

Abstract

MARS: Benchmarking the Metaphysical Reasoning Abilities of Language Models with a Multi-task Evaluation Dataset

Authors

TL;DR

Abstract

Table of Contents

Figures (7)