ASSESS: A Semantic and Structural Evaluation Framework for Statement Similarity
Xiaoyang Liu, Tao Zhu, Zineng Dong, Yuntian Liu, Qingfeng Guo, Zhaoxuan Liu, Yu Chen, Tao Luo
TL;DR
This work presents ASSESS, a semantic-structural evaluation framework for statement similarity in autoformalization. It combines operator-tree representations with a Tree Edit Distance baseline and a novel TransTED Similarity that incorporates semantic transformations to capture equivalence beyond surface form. The EPLA benchmark, built from miniF2F and ProofNet, provides 1,247 expert-annotated pairs labeled for Provability and Likeness, enabling rigorous, graded evaluation. Experimental results show TransTED achieves state-of-the-art accuracy and Kappa on both EPLA-miniF2F and EPLA-ProofNet, outperforming string-based, proof-based, and LLM-based baselines and demonstrating robustness to thresholding. The work includes ablations and analysis of tactic commands, and releases code and benchmark data to support reproducibility and further research.
Abstract
Despite significant strides in statement autoformalization, a critical gap remains in the development of automated evaluation metrics capable of assessing formal translation quality. Existing metrics often fail to balance semantic and structural information: string-based methods neglect semantics, whereas proof-based approaches offer no graded similarity when proofs fail. To address these issues, we introduce ASSESS (A Semantic and Structural Evaluation Framework for Statement Similarity), which captures syntactic structure by transforming formal statements into operator trees and computes a real-valued similarity score using our novel TransTED (Transformation Tree Edit Distance) Similarity metric by incorporating semantic transformations. For rigorous validation, we present EPLA (Evaluating Provability and Likeness for Autoformalization), a benchmark comprising 1,247 expert-annotated formal statement pairs derived from miniF2F and ProofNet, distinctively labeled for both semantic provability and structural likeness. Experiments on the EPLA benchmark demonstrate that TransTED Similarity surpasses existing methods, achieving state-of-the-art accuracy and Kappa score. The benchmark dataset, code, and detailed experimental results are available at https://github.com/XiaoyangLiu-sjtu/ASSESS.
