Table of Contents
Fetching ...

FRoG: Evaluating Fuzzy Reasoning of Generalized Quantifiers in Large Language Models

Yiyuan Li, Shichao Sun, Pengfei Liu

TL;DR

FRoG introduces a fuzzy reasoning benchmark that replaces precise numerical data with generalized quantifiers in real-world math problems, enabling evaluation of LLMs on GQ-based tasks. The study reveals persistent challenges in fuzzy reasoning, with an inverse scaling trend across many model families and limited, inconsistent gains from math- or code-specialized tuning and general alignment. It also shows that strong mathematical reasoning does not reliably predict success on FRoG, and that scaling laws are not universal, highlighting the need for explicit modeling of quantifier semantics. Overall, FRoG demonstrates diverse reasoning strategies across models and underscores the importance of studying fuzzy, language-driven uncertainty in LLMs for real-world decision making.

Abstract

Fuzzy reasoning is vital due to the frequent use of imprecise information in daily contexts. However, the ability of current large language models (LLMs) to handle such reasoning remains largely uncharted. In this paper, we introduce a new benchmark, FRoG, for fuzzy reasoning, featuring real-world mathematical word problems that incorporate generalized quantifiers. Our experimental findings reveal that fuzzy reasoning continues to pose significant challenges for LLMs. Moreover, we find that existing methods designed to enhance reasoning do not consistently improve performance in tasks involving fuzzy logic. Additionally, our results show an inverse scaling effect in the performance of LLMs on FRoG. Interestingly, we also demonstrate that strong mathematical reasoning skills are not necessarily indicative of success on our benchmark.

FRoG: Evaluating Fuzzy Reasoning of Generalized Quantifiers in Large Language Models

TL;DR

FRoG introduces a fuzzy reasoning benchmark that replaces precise numerical data with generalized quantifiers in real-world math problems, enabling evaluation of LLMs on GQ-based tasks. The study reveals persistent challenges in fuzzy reasoning, with an inverse scaling trend across many model families and limited, inconsistent gains from math- or code-specialized tuning and general alignment. It also shows that strong mathematical reasoning does not reliably predict success on FRoG, and that scaling laws are not universal, highlighting the need for explicit modeling of quantifier semantics. Overall, FRoG demonstrates diverse reasoning strategies across models and underscores the importance of studying fuzzy, language-driven uncertainty in LLMs for real-world decision making.

Abstract

Fuzzy reasoning is vital due to the frequent use of imprecise information in daily contexts. However, the ability of current large language models (LLMs) to handle such reasoning remains largely uncharted. In this paper, we introduce a new benchmark, FRoG, for fuzzy reasoning, featuring real-world mathematical word problems that incorporate generalized quantifiers. Our experimental findings reveal that fuzzy reasoning continues to pose significant challenges for LLMs. Moreover, we find that existing methods designed to enhance reasoning do not consistently improve performance in tasks involving fuzzy logic. Additionally, our results show an inverse scaling effect in the performance of LLMs on FRoG. Interestingly, we also demonstrate that strong mathematical reasoning skills are not necessarily indicative of success on our benchmark.
Paper Structure (19 sections, 8 figures, 8 tables)

This paper contains 19 sections, 8 figures, 8 tables.

Figures (8)

  • Figure 1: Workflow of FRoG construction.
  • Figure 2: (Top) quantifier proportions in FRoG. (Bottom) percentiles of target percentage mentions categorized by quantifiers. Green and orange lines represent the means and medians, respectively. The x-axis is shared between the two figures.
  • Figure 3: The average Mask accuracy in FRoG-Easy and FRoG-Hard of several LLMs sorting in ascending order. Dots with the same color belong to the same model family. Models with additional pretraining or instruction tuning do not necessarily perform better. We refer to Figure \ref{['fig: impact_math_train_mathqa']} and Figure \ref{['fig: impact_alignment_mathqa']} for more details.
  • Figure 4: Impacts of continuous pretraining on mathematical data of LLMs on the performance of FRoG. The solid and dashed lines represent FRoG-Hard and FRoG-Easy respectively. The result of CodeLlama (70B) is emitted for illustration due to its poor performance.
  • Figure 5: The accuracy of Mask of Qwen-1.5-Chat models, the real and dashed lines represent the hard and easy split, respectively.
  • ...and 3 more figures