The Rarity Blind Spot: A Framework for Evaluating Statistical Reasoning in LLMs
Seiji Maekawa, Hayate Iso, Nikita Bhutani
TL;DR
Distinctive Feature Mining (DFM) challenges LLMs to identify globally rare features across document collections of size $n \in \{10,20,40\}$, addressing a gap in benchmarks that emphasize retrieval rather than population-level rarity. The authors operationalize DFM with DiFBench, a configurable framework that constructs datasets with a rarity threshold $\theta$ and total features distributed across documents, enabling controlled evaluation of corpus-level statistical reasoning. Ten models spanning reasoning-enhanced and general variants are evaluated, revealing that reasoning models outperform general ones but degrade as collection size and rarity strictness increase, with base-rate neglect manifesting as frequent features misidentified as distinctive. A simple explicit verification prompting strategy improves precision by about 65% relative, offering a practical mitigation while highlighting persistent limitations in multi-document reasoning and frequency estimation. Overall, the work exposes fundamental gaps in current LLMs’ corpus-level statistical reasoning and provides a path toward more reliable danger-free decision-support via targeted prompting and evaluation frameworks.
Abstract
Effective decision-making often relies on identifying what makes each candidate distinctive. While existing benchmarks for LLMs emphasize retrieving or summarizing information relevant to a given query, they do not evaluate a model's ability to identify globally distinctive features across a set of documents. We introduce Distinctive Feature Mining (DFM), a new task that challenges models to analyze a small-to-medium collection (10-40 documents) and surface features that are rare in the global context (e.g., appearing in less than 10% of documents). This setting mirrors real-world scenarios such as candidate selection or product differentiation, where statistical reasoning, not retrieval, is key. To enable systematic evaluation of this capability, we present DiFBench, a configurable benchmark creation framework with controllable parameters such as document set size and distinctiveness thresholds. Using DiFBench, we perform a large-scale assessment of distinctive feature mining across ten state-of-the-art LLMs. Our findings reveal a significant performance gap between general-purpose and reasoning-enhanced models. All models, however, substantially degrade as the task complexity and document count increase. We also find that a common failure mode is misidentifying frequent features as distinctive. These insights reveal core limitations in contemporary LLMs' abilities to perform fine-grained, statistical reasoning and rarity detection.
