Table of Contents
Fetching ...

Normative Reasoning in Large Language Models: A Comparative Benchmark from Logical and Modal Perspectives

Kentaro Ozeki, Risako Ando, Takanobu Morishita, Hirohiko Abe, Koji Mineshima, Mitsuhiro Okada

TL;DR

The paper addresses normative reasoning in large language models by assessing both deontic and epistemic modalities through a new benchmark that combines formal patterns (deontic logic and syllogistic reasoning) with non-formal cognitive factors (domain specificity and content effects). It systematically evaluates multiple models under Zero-Shot, Few-Shot, and Chain-of-Thought prompting, revealing that even strong models show inconsistencies and human-like biases in normative reasoning, particularly around negation and certain paradoxes (e.g., Free Choice and Ross's paradox). The work demonstrates task-dependent variations in model performance across normative versus epistemic domains and prompts a careful consideration of prompting strategies, especially CoT, which can both help and hinder reliability. Overall, the findings highlight important challenges in achieving logical consistency for normative reasoning in LLMs and provide a publicly released dataset and codebase to drive further improvements and comparative analyses.

Abstract

Normative reasoning is a type of reasoning that involves normative or deontic modality, such as obligation and permission. While large language models (LLMs) have demonstrated remarkable performance across various reasoning tasks, their ability to handle normative reasoning remains underexplored. In this paper, we systematically evaluate LLMs' reasoning capabilities in the normative domain from both logical and modal perspectives. Specifically, to assess how well LLMs reason with normative modals, we make a comparison between their reasoning with normative modals and their reasoning with epistemic modals, which share a common formal structure. To this end, we introduce a new dataset covering a wide range of formal patterns of reasoning in both normative and epistemic domains, while also incorporating non-formal cognitive factors that influence human reasoning. Our results indicate that, although LLMs generally adhere to valid reasoning patterns, they exhibit notable inconsistencies in specific types of normative reasoning and display cognitive biases similar to those observed in psychological studies of human reasoning. These findings highlight challenges in achieving logical consistency in LLMs' normative reasoning and provide insights for enhancing their reliability. All data and code are released publicly at https://github.com/kmineshima/NeuBAROCO.

Normative Reasoning in Large Language Models: A Comparative Benchmark from Logical and Modal Perspectives

TL;DR

The paper addresses normative reasoning in large language models by assessing both deontic and epistemic modalities through a new benchmark that combines formal patterns (deontic logic and syllogistic reasoning) with non-formal cognitive factors (domain specificity and content effects). It systematically evaluates multiple models under Zero-Shot, Few-Shot, and Chain-of-Thought prompting, revealing that even strong models show inconsistencies and human-like biases in normative reasoning, particularly around negation and certain paradoxes (e.g., Free Choice and Ross's paradox). The work demonstrates task-dependent variations in model performance across normative versus epistemic domains and prompts a careful consideration of prompting strategies, especially CoT, which can both help and hinder reliability. Overall, the findings highlight important challenges in achieving logical consistency for normative reasoning in LLMs and provide a publicly released dataset and codebase to drive further improvements and comparative analyses.

Abstract

Normative reasoning is a type of reasoning that involves normative or deontic modality, such as obligation and permission. While large language models (LLMs) have demonstrated remarkable performance across various reasoning tasks, their ability to handle normative reasoning remains underexplored. In this paper, we systematically evaluate LLMs' reasoning capabilities in the normative domain from both logical and modal perspectives. Specifically, to assess how well LLMs reason with normative modals, we make a comparison between their reasoning with normative modals and their reasoning with epistemic modals, which share a common formal structure. To this end, we introduce a new dataset covering a wide range of formal patterns of reasoning in both normative and epistemic domains, while also incorporating non-formal cognitive factors that influence human reasoning. Our results indicate that, although LLMs generally adhere to valid reasoning patterns, they exhibit notable inconsistencies in specific types of normative reasoning and display cognitive biases similar to those observed in psychological studies of human reasoning. These findings highlight challenges in achieving logical consistency in LLMs' normative reasoning and provide insights for enhancing their reliability. All data and code are released publicly at https://github.com/kmineshima/NeuBAROCO.

Paper Structure

This paper contains 27 sections, 9 figures, 13 tables.

Figures (9)

  • Figure 1: The two reasoning patterns are logically related (contrapositive) but LLMs often struggle to make consistent predictions. $\mathsf{O} A$ means "$A$ is obligatory" and $\mathsf{P} A$ means "$A$ is permitted." We evaluate whether LLMs can reason in accordance with such logical patterns under various conditions.
  • Figure 2: Accuracy (%) of the best-performing model (GPT-4o) for Mu-Mi, FC-Or-Intro, Ross-Or-Intro patterns in the Deontic Logic task. The FC-Or-Intro and Ross-Or-Intro patterns are Normative problems only.
  • Figure 3: Accuracy (%) of the best-performing model (GPT-4o) for Cat-MT, Cat-DA, Hyp-MT, Hyp-DA patterns in the Syllogistic task.
  • Figure 4: An example output of GPT-4o with a CoT prompt for reasoning from obligation to permission (Mu-Mi), which includes the expression "can choose to."
  • Figure 5: Examples of errors for Mu-Mi.
  • ...and 4 more figures