Table of Contents
Fetching ...

The Rarity Blind Spot: A Framework for Evaluating Statistical Reasoning in LLMs

Seiji Maekawa, Hayate Iso, Nikita Bhutani

TL;DR

Distinctive Feature Mining (DFM) challenges LLMs to identify globally rare features across document collections of size $n \in \{10,20,40\}$, addressing a gap in benchmarks that emphasize retrieval rather than population-level rarity. The authors operationalize DFM with DiFBench, a configurable framework that constructs datasets with a rarity threshold $\theta$ and total features distributed across documents, enabling controlled evaluation of corpus-level statistical reasoning. Ten models spanning reasoning-enhanced and general variants are evaluated, revealing that reasoning models outperform general ones but degrade as collection size and rarity strictness increase, with base-rate neglect manifesting as frequent features misidentified as distinctive. A simple explicit verification prompting strategy improves precision by about 65% relative, offering a practical mitigation while highlighting persistent limitations in multi-document reasoning and frequency estimation. Overall, the work exposes fundamental gaps in current LLMs’ corpus-level statistical reasoning and provides a path toward more reliable danger-free decision-support via targeted prompting and evaluation frameworks.

Abstract

Effective decision-making often relies on identifying what makes each candidate distinctive. While existing benchmarks for LLMs emphasize retrieving or summarizing information relevant to a given query, they do not evaluate a model's ability to identify globally distinctive features across a set of documents. We introduce Distinctive Feature Mining (DFM), a new task that challenges models to analyze a small-to-medium collection (10-40 documents) and surface features that are rare in the global context (e.g., appearing in less than 10% of documents). This setting mirrors real-world scenarios such as candidate selection or product differentiation, where statistical reasoning, not retrieval, is key. To enable systematic evaluation of this capability, we present DiFBench, a configurable benchmark creation framework with controllable parameters such as document set size and distinctiveness thresholds. Using DiFBench, we perform a large-scale assessment of distinctive feature mining across ten state-of-the-art LLMs. Our findings reveal a significant performance gap between general-purpose and reasoning-enhanced models. All models, however, substantially degrade as the task complexity and document count increase. We also find that a common failure mode is misidentifying frequent features as distinctive. These insights reveal core limitations in contemporary LLMs' abilities to perform fine-grained, statistical reasoning and rarity detection.

The Rarity Blind Spot: A Framework for Evaluating Statistical Reasoning in LLMs

TL;DR

Distinctive Feature Mining (DFM) challenges LLMs to identify globally rare features across document collections of size , addressing a gap in benchmarks that emphasize retrieval rather than population-level rarity. The authors operationalize DFM with DiFBench, a configurable framework that constructs datasets with a rarity threshold and total features distributed across documents, enabling controlled evaluation of corpus-level statistical reasoning. Ten models spanning reasoning-enhanced and general variants are evaluated, revealing that reasoning models outperform general ones but degrade as collection size and rarity strictness increase, with base-rate neglect manifesting as frequent features misidentified as distinctive. A simple explicit verification prompting strategy improves precision by about 65% relative, offering a practical mitigation while highlighting persistent limitations in multi-document reasoning and frequency estimation. Overall, the work exposes fundamental gaps in current LLMs’ corpus-level statistical reasoning and provides a path toward more reliable danger-free decision-support via targeted prompting and evaluation frameworks.

Abstract

Effective decision-making often relies on identifying what makes each candidate distinctive. While existing benchmarks for LLMs emphasize retrieving or summarizing information relevant to a given query, they do not evaluate a model's ability to identify globally distinctive features across a set of documents. We introduce Distinctive Feature Mining (DFM), a new task that challenges models to analyze a small-to-medium collection (10-40 documents) and surface features that are rare in the global context (e.g., appearing in less than 10% of documents). This setting mirrors real-world scenarios such as candidate selection or product differentiation, where statistical reasoning, not retrieval, is key. To enable systematic evaluation of this capability, we present DiFBench, a configurable benchmark creation framework with controllable parameters such as document set size and distinctiveness thresholds. Using DiFBench, we perform a large-scale assessment of distinctive feature mining across ten state-of-the-art LLMs. Our findings reveal a significant performance gap between general-purpose and reasoning-enhanced models. All models, however, substantially degrade as the task complexity and document count increase. We also find that a common failure mode is misidentifying frequent features as distinctive. These insights reveal core limitations in contemporary LLMs' abilities to perform fine-grained, statistical reasoning and rarity detection.

Paper Structure

This paper contains 37 sections, 11 figures, 4 tables.

Figures (11)

  • Figure 1: Example of Distinctive Feature Mining (DFM). Given a set of documents, the model needs to identify globally rare features. Here, the model incorrectly identifies "NLP experience" as distinctive, when it is shared by all documents. In contrast, it misses the truly rare feature "Agentic AI development".
  • Figure 2: Overview of DiFBench. To obtain distinctive features, $\mathcal{F}^\delta$, we first randomly select $k$ features from the feature set $\mathcal{F}$. The remaining features are treated as common features, $\mathcal{F}^{\neg\delta}$. Distinctive features $\mathcal{F}^\delta$ are distributed across documents while ensuring that each feature appears less than or equal to $\theta$% of the documents. Common features $\mathcal{F}^{\neg\delta}$ are then distributed across documents, ensuring that each feature appears in more than $\theta$% of the $n$ documents.
  • Figure 3: F1 scores with various document sizes. The error bars indicate the standard deviation across samples.
  • Figure 4: F1 scores with $40$ documents and various distinctive thresholds.
  • Figure 5: The precision and recall
  • ...and 6 more figures