Table of Contents
Fetching ...

Psychometric-Based Evaluation for Theorem Proving with Large Language Models

Jianyu Zhang, Yongwang Zhao, Long Zhang, Jilin Hu, Xiaokun Luan, Zhiwei Xu, Feng Yang

TL;DR

The paper addresses the limitations of $Pass@N$-style evaluation for LLM theorem proving, notably its equal weighting of theorems and high computational cost. It introduces a psychometric-based framework with Dataset Annotation (miniF2F-Graded) and Adaptive Evaluation to dynamically test high-information theorems and iteratively update an LLM's ability score. The approach yields finer discrimination among 10 open-source LLMs and reduces evaluation cost by $76.13\%$, while maintaining or improving insights into model gaps. This work offers a scalable, nuanced benchmark for LLM theorem proving and a general blueprint for applying adaptive evaluation to formal datasets.

Abstract

Large language models (LLMs) for formal theorem proving have become a prominent research focus. At present, the proving ability of these LLMs is mainly evaluated through proof pass rates on datasets such as miniF2F. However, this evaluation method overlooks the varying importance of theorems. As a result, it fails to highlight the real performance disparities between LLMs and leads to high evaluation costs. This study proposes a psychometric-based evaluation method for theorem proving with LLMs, comprising two main components: Dataset Annotation and Adaptive Evaluation. First, we propose a metric calculation method to annotate the dataset with difficulty and discrimination metrics. Specifically, we annotate each theorem in the miniF2F dataset and grade them into varying difficulty levels according to the performance of LLMs, resulting in an enhanced dataset: miniF2F-Graded. Experimental results show that the difficulty grading in miniF2F-Graded better reflects the theorem difficulty perceived by LLMs. Secondly, we design an adaptive evaluation method to dynamically select the most suitable theorems for testing based on the annotated metrics and the real-time performance of LLMs. We apply this method to evaluate 10 LLMs. The results show that our method finely highlights the performance disparities between LLMs. It also reduces evaluation costs by using only 23% of the theorems in the dataset.

Psychometric-Based Evaluation for Theorem Proving with Large Language Models

TL;DR

The paper addresses the limitations of -style evaluation for LLM theorem proving, notably its equal weighting of theorems and high computational cost. It introduces a psychometric-based framework with Dataset Annotation (miniF2F-Graded) and Adaptive Evaluation to dynamically test high-information theorems and iteratively update an LLM's ability score. The approach yields finer discrimination among 10 open-source LLMs and reduces evaluation cost by , while maintaining or improving insights into model gaps. This work offers a scalable, nuanced benchmark for LLM theorem proving and a general blueprint for applying adaptive evaluation to formal datasets.

Abstract

Large language models (LLMs) for formal theorem proving have become a prominent research focus. At present, the proving ability of these LLMs is mainly evaluated through proof pass rates on datasets such as miniF2F. However, this evaluation method overlooks the varying importance of theorems. As a result, it fails to highlight the real performance disparities between LLMs and leads to high evaluation costs. This study proposes a psychometric-based evaluation method for theorem proving with LLMs, comprising two main components: Dataset Annotation and Adaptive Evaluation. First, we propose a metric calculation method to annotate the dataset with difficulty and discrimination metrics. Specifically, we annotate each theorem in the miniF2F dataset and grade them into varying difficulty levels according to the performance of LLMs, resulting in an enhanced dataset: miniF2F-Graded. Experimental results show that the difficulty grading in miniF2F-Graded better reflects the theorem difficulty perceived by LLMs. Secondly, we design an adaptive evaluation method to dynamically select the most suitable theorems for testing based on the annotated metrics and the real-time performance of LLMs. We apply this method to evaluate 10 LLMs. The results show that our method finely highlights the performance disparities between LLMs. It also reduces evaluation costs by using only 23% of the theorems in the dataset.

Paper Structure

This paper contains 34 sections, 5 equations, 5 figures, 7 tables, 2 algorithms.

Figures (5)

  • Figure 1: The method consists of two parts. In Part 1: Dataset Annotation, difficulty and discrimination metrics are assigned to each theorem in the miniF2F dataset based on the performance of LLMs, as determined by the Metric Calculation process, resulting in an annotated dataset, miniF2F-Graded. In Part 2: Adaptive Evaluation, the most suitable theorems are dynamically selected for testing based on their annotated metrics in miniF2F-Graded and the initial or current ability scores of LLMs, following the Theorem Selection Algorithm. Throughout the iterative testing process, ability scores are continuously updated under the Adaptive Test Algorithm until they converge, with the final stabilized scores serving as the evaluation results.
  • Figure 2: The scatter plot of the dataset annotation results illustrates the relationship between theorem difficulty (x-axis) and discrimination (y-axis).
  • Figure 3: Confusion matrix of model rankings.
  • Figure 4: The scatter plot of the dataset annotation results when the correction term is not used in the difficulty metric calculation.
  • Figure 5: Difficulty Distribution Chart for Theorems in Various Categories..