Table of Contents
Fetching ...

ASSESS: A Semantic and Structural Evaluation Framework for Statement Similarity

Xiaoyang Liu, Tao Zhu, Zineng Dong, Yuntian Liu, Qingfeng Guo, Zhaoxuan Liu, Yu Chen, Tao Luo

TL;DR

This work presents ASSESS, a semantic-structural evaluation framework for statement similarity in autoformalization. It combines operator-tree representations with a Tree Edit Distance baseline and a novel TransTED Similarity that incorporates semantic transformations to capture equivalence beyond surface form. The EPLA benchmark, built from miniF2F and ProofNet, provides 1,247 expert-annotated pairs labeled for Provability and Likeness, enabling rigorous, graded evaluation. Experimental results show TransTED achieves state-of-the-art accuracy and Kappa on both EPLA-miniF2F and EPLA-ProofNet, outperforming string-based, proof-based, and LLM-based baselines and demonstrating robustness to thresholding. The work includes ablations and analysis of tactic commands, and releases code and benchmark data to support reproducibility and further research.

Abstract

Despite significant strides in statement autoformalization, a critical gap remains in the development of automated evaluation metrics capable of assessing formal translation quality. Existing metrics often fail to balance semantic and structural information: string-based methods neglect semantics, whereas proof-based approaches offer no graded similarity when proofs fail. To address these issues, we introduce ASSESS (A Semantic and Structural Evaluation Framework for Statement Similarity), which captures syntactic structure by transforming formal statements into operator trees and computes a real-valued similarity score using our novel TransTED (Transformation Tree Edit Distance) Similarity metric by incorporating semantic transformations. For rigorous validation, we present EPLA (Evaluating Provability and Likeness for Autoformalization), a benchmark comprising 1,247 expert-annotated formal statement pairs derived from miniF2F and ProofNet, distinctively labeled for both semantic provability and structural likeness. Experiments on the EPLA benchmark demonstrate that TransTED Similarity surpasses existing methods, achieving state-of-the-art accuracy and Kappa score. The benchmark dataset, code, and detailed experimental results are available at https://github.com/XiaoyangLiu-sjtu/ASSESS.

ASSESS: A Semantic and Structural Evaluation Framework for Statement Similarity

TL;DR

This work presents ASSESS, a semantic-structural evaluation framework for statement similarity in autoformalization. It combines operator-tree representations with a Tree Edit Distance baseline and a novel TransTED Similarity that incorporates semantic transformations to capture equivalence beyond surface form. The EPLA benchmark, built from miniF2F and ProofNet, provides 1,247 expert-annotated pairs labeled for Provability and Likeness, enabling rigorous, graded evaluation. Experimental results show TransTED achieves state-of-the-art accuracy and Kappa on both EPLA-miniF2F and EPLA-ProofNet, outperforming string-based, proof-based, and LLM-based baselines and demonstrating robustness to thresholding. The work includes ablations and analysis of tactic commands, and releases code and benchmark data to support reproducibility and further research.

Abstract

Despite significant strides in statement autoformalization, a critical gap remains in the development of automated evaluation metrics capable of assessing formal translation quality. Existing metrics often fail to balance semantic and structural information: string-based methods neglect semantics, whereas proof-based approaches offer no graded similarity when proofs fail. To address these issues, we introduce ASSESS (A Semantic and Structural Evaluation Framework for Statement Similarity), which captures syntactic structure by transforming formal statements into operator trees and computes a real-valued similarity score using our novel TransTED (Transformation Tree Edit Distance) Similarity metric by incorporating semantic transformations. For rigorous validation, we present EPLA (Evaluating Provability and Likeness for Autoformalization), a benchmark comprising 1,247 expert-annotated formal statement pairs derived from miniF2F and ProofNet, distinctively labeled for both semantic provability and structural likeness. Experiments on the EPLA benchmark demonstrate that TransTED Similarity surpasses existing methods, achieving state-of-the-art accuracy and Kappa score. The benchmark dataset, code, and detailed experimental results are available at https://github.com/XiaoyangLiu-sjtu/ASSESS.

Paper Structure

This paper contains 40 sections, 1 theorem, 10 equations, 6 figures, 5 tables, 1 algorithm.

Key Result

Theorem 1

Let $X$ be an arbitrary set, $b : X \times X \to \mathbb R_{\geq 0}$ be a function and $T$ is a subset of $(X \times X)^2$. Consider the function space Then there exists a unique maximum function $\overline f \in \mathcal{F}$ such that for any $(x, y) \in X \times X$,

Figures (6)

  • Figure 1: Overview of the ASSESS Framework. (a) EPLA Benchmark: This dataset is constructed using four distinct translators to generate Formal Language (FL) pairs, followed by compilation checks and human evaluation. (b) TED Similarity: A baseline metric computed by converting FL pairs into operator trees to calculate the Tree Edit Distance (TED) similarity. (c) TransTED Similarity: A novel metric that reformulates an FL pair as an equality. It conducts a tree search guided by specific tactic commands, utilizing TED as a heuristic to minimize the edit distance. The search halts upon satisfying any of the following conditions: successful proof generation (Proved), Node Limit Exceeded (NLE), or Time Limit Exceeded (TLE).
  • Figure 2: The operator tree (OPT) of a formal statement.
  • Figure 3: Comparison of BLEU, TED Similarity, and TransTED Similarity across varying decision thresholds on EPLA-miniF2F and EPLA-ProofNet. Red stars indicate the optimal threshold configuration for each metric. Details regarding threshold selection are provided in Appendix \ref{['app:threshold_selection']}.
  • Figure 4: Statistic of the usage frequencies and adoption rates of tactic commands. (A) Usage Percentage (total): The proportion of each tactic command's successful applications out of all successful applications during the entire search process. (B) Usage Percentage (adopted): The proportion of each tactic command's adoption counts out of all adoption counts. (C) Adoption Rate: The proportion of each tactic command's adoption counts out of its total application counts.
  • Figure 5: Statistic of the tactic commands in EPLA-miniF2F
  • ...and 1 more figures

Theorems & Definitions (5)

  • Definition 1: Shortest-path distance
  • Definition 2: Tree Edit Distance
  • Definition 3: TED Similarity
  • Theorem 1
  • proof