Table of Contents
Fetching ...

Evolutionary Search for Automated Design of Uncertainty Quantification Methods

Mikhail Seleznyov, Daniil Korbut, Viktor Moskvoretskii, Oleg Somov, Alexander Panchenko, Elena Tutubalina

Abstract

Uncertainty quantification (UQ) methods for large language models are predominantly designed by hand based on domain knowledge and heuristics, limiting their scalability and generality. We apply LLM-powered evolutionary search to automatically discover unsupervised UQ methods represented as Python programs. On the task of atomic claim verification, our evolved methods outperform strong manually-designed baselines, achieving up to 6.7% relative ROC-AUC improvement across 9 datasets while generalizing robustly out-of-distribution. Qualitative analysis reveals that different LLMs employ qualitatively distinct evolutionary strategies: Claude models consistently design high-feature-count linear estimators, while Gpt-oss-120B gravitates toward simpler and more interpretable positional weighting schemes. Surprisingly, only Sonnet 4.5 and Opus 4.5 reliably leverage increased method complexity to improve performance -- Opus 4.6 shows an unexpected regression relative to its predecessor. Overall, our results indicate that LLM-powered evolutionary search is a promising paradigm for automated, interpretable hallucination detector design.

Evolutionary Search for Automated Design of Uncertainty Quantification Methods

Abstract

Uncertainty quantification (UQ) methods for large language models are predominantly designed by hand based on domain knowledge and heuristics, limiting their scalability and generality. We apply LLM-powered evolutionary search to automatically discover unsupervised UQ methods represented as Python programs. On the task of atomic claim verification, our evolved methods outperform strong manually-designed baselines, achieving up to 6.7% relative ROC-AUC improvement across 9 datasets while generalizing robustly out-of-distribution. Qualitative analysis reveals that different LLMs employ qualitatively distinct evolutionary strategies: Claude models consistently design high-feature-count linear estimators, while Gpt-oss-120B gravitates toward simpler and more interpretable positional weighting schemes. Surprisingly, only Sonnet 4.5 and Opus 4.5 reliably leverage increased method complexity to improve performance -- Opus 4.6 shows an unexpected regression relative to its predecessor. Overall, our results indicate that LLM-powered evolutionary search is a promising paradigm for automated, interpretable hallucination detector design.

Paper Structure

This paper contains 26 sections, 1 equation, 7 figures, 3 tables.

Figures (7)

  • Figure 1: A visualization of LLM-powered evolutionary search pipeline, used to autonomously design uncertainty quantification (UQ) methods for hallucination detection tasks. At the start the candidate pool is initialized with a single selected baseline method.
  • Figure 2: Average ROC-AUC across 9 hallucination detection datasets (atomic factual claims). For Claude-generated candidates, top-30 methods were selected by validation performance on PopQA dataset. We report 5 of them with test-performance ranks 1, 8, 15, 22, and 30 respectively. For Gpt-oss-generated methods, we show every second method among top-16 by validation performance on PopQA.
  • Figure 3: Left: average and median token lengths of atomic claims. Right: visualization of exponential and linear weighting for several typical claim lengths. Exponential weighting puts more emphasis on last tokens than linear weighting, especially for shorter sequences.
  • Figure 4: Correlation between method complexity and performance for different LLMs. Each dot corresponds to one evolution run.
  • Figure 5: Evolution dynamics in (complexity, performance) coordinates for 6 different models. We use line count as the simplest proxy for method complexity; other proxies such as amount of AST nodes, number of binary and unary operators, or Halstead volume follow the same pattern.
  • ...and 2 more figures