Table of Contents
Fetching ...

Discovering Knowledge Deficiencies of Language Models on Massive Knowledge Base

Linxin Song, Xuwei Ding, Jieyu Zhang, Taiwei Shi, Ryotaro Shimizu, Rahul Gupta, Yang Liu, Jian Kang, Jieyu Zhao

TL;DR

The paper addresses the challenge of identifying factual knowledge deficiencies in large language models when confronted with massive knowledge bases, especially for closed-weight models. It introduces stochastic error ascent (SEA), a scalable, budget-aware framework that performs error-driven probing using semantic similarity, hierarchical retrieval, and a relation DAG to reveal error propagation and systematic failures. SEA significantly outperforms baselines (ACD and AutoBencher) in both the number of errors discovered and cost efficiency, validated across multiple LLM families and through human QA evaluation. The work provides insights into model weaknesses, data coverage gaps, and directions for targeted data collection and fine-tuning, with potential extensions to multimodal domains and expanded search scopes.

Abstract

Large language models (LLMs) possess impressive linguistic capabilities but often fail to faithfully retain factual knowledge, leading to hallucinations and unreliable outputs. Understanding LLMs' knowledge deficiencies by exhaustively evaluating against full-scale knowledge bases is computationally prohibitive, especially for closed-weight models. We propose stochastic error ascent (SEA), a scalable and efficient framework for discovering knowledge deficiencies (errors) in closed-weight LLMs under a strict query budget. Rather than naively probing all knowledge candidates, SEA formulates error discovery as a stochastic optimization process: it iteratively retrieves new high-error candidates by leveraging the semantic similarity to previously observed failures. To further enhance search efficiency and coverage, SEA employs hierarchical retrieval across document and paragraph levels, and constructs a relation directed acyclic graph to model error propagation and identify systematic failure modes. Empirically, SEA uncovers 40.7x more knowledge errors than Automated Capability Discovery and 26.7% more than AutoBencher, while reducing the cost-per-error by 599x and 9x, respectively. Human evaluation confirms the high quality of generated questions, while ablation and convergence analyses validate the contribution of each component in SEA. Further analysis on the discovered errors reveals correlated failure patterns across LLM families and recurring deficits, highlighting the need for better data coverage and targeted fine-tuning in future LLM development.

Discovering Knowledge Deficiencies of Language Models on Massive Knowledge Base

TL;DR

The paper addresses the challenge of identifying factual knowledge deficiencies in large language models when confronted with massive knowledge bases, especially for closed-weight models. It introduces stochastic error ascent (SEA), a scalable, budget-aware framework that performs error-driven probing using semantic similarity, hierarchical retrieval, and a relation DAG to reveal error propagation and systematic failures. SEA significantly outperforms baselines (ACD and AutoBencher) in both the number of errors discovered and cost efficiency, validated across multiple LLM families and through human QA evaluation. The work provides insights into model weaknesses, data coverage gaps, and directions for targeted data collection and fine-tuning, with potential extensions to multimodal domains and expanded search scopes.

Abstract

Large language models (LLMs) possess impressive linguistic capabilities but often fail to faithfully retain factual knowledge, leading to hallucinations and unreliable outputs. Understanding LLMs' knowledge deficiencies by exhaustively evaluating against full-scale knowledge bases is computationally prohibitive, especially for closed-weight models. We propose stochastic error ascent (SEA), a scalable and efficient framework for discovering knowledge deficiencies (errors) in closed-weight LLMs under a strict query budget. Rather than naively probing all knowledge candidates, SEA formulates error discovery as a stochastic optimization process: it iteratively retrieves new high-error candidates by leveraging the semantic similarity to previously observed failures. To further enhance search efficiency and coverage, SEA employs hierarchical retrieval across document and paragraph levels, and constructs a relation directed acyclic graph to model error propagation and identify systematic failure modes. Empirically, SEA uncovers 40.7x more knowledge errors than Automated Capability Discovery and 26.7% more than AutoBencher, while reducing the cost-per-error by 599x and 9x, respectively. Human evaluation confirms the high quality of generated questions, while ablation and convergence analyses validate the contribution of each component in SEA. Further analysis on the discovered errors reveals correlated failure patterns across LLM families and recurring deficits, highlighting the need for better data coverage and targeted fine-tuning in future LLM development.

Paper Structure

This paper contains 19 sections, 5 equations, 7 figures, 12 tables.

Figures (7)

  • Figure 1: Overall workflow of stochastic error ascent (SEA). We search for a closed-weight model's unknown knowledge iteratively from a given knowledge base until we reach the budget. The result from SEA can be further used to analyze the model's unknown categories and error patterns.
  • Figure 2: Comparison of errors discovered by ACD, AutoBencher, and SEA. We compare ACD with SEA among the same budget while comparing AutoBencher among the same question size. For ACD, we summarized the number of failed tasks, and for SEA, we summarized the number of source errors. We let AutoBencher create 13 benchmarks, each of which takes one of the Wikipedia categories as an interesting topic. We let SEA search the same number of questions according to each model. o1-mini failed on ACD due to the violation of the prompt usage policy from OpenAI.
  • Figure 3: Per-step error $T_{E}(f_{\text{close}})$ and cumulative error $T_{{\mathcal{S}}}(f_{\text{close}})$ for each model. We observe that the errors of all models are positively related to step, indicating SEA can gradually and continually find the model's knowledge deficiencies from the knowledge base.
  • Figure 4: Ablation studies on the component contribution of SEA. We compare SEA with its two variants: without source pruning (i.e., pass the lines 10 and 11 in Alg. \ref{['alg:1']}) and random selection (i.e., pass the lines 9, 10, and 11 in Alg. \ref{['alg:1']}). We observe that each component contributes equally to SEA.
  • Figure 5: Comparison of cross-validation between each model. X-axis indicates the subset provider (i.e., $\hat{{\mathcal{S}}}$ provider; sourced from experiments in Fig. \ref{['fig:per-step-acc']}), and Y-axis denotes the testee. We summarize two results: (1) correlation between testee result and provider result, and (2) accuracy of testee on each provider's results. The higher the correlation, the more similar the answers of the two models are. Similarly, the higher the testee's accuracy, the more challenging the provider's question.
  • ...and 2 more figures