Table of Contents
Fetching ...

ReForm: Reflective Autoformalization with Prospective Bounded Sequence Optimization

Guoxin Chen, Jing Wu, Xinjie Chen, Wayne Xin Zhao, Ruihua Song, Chengxi Li, Kai Fan, Dayiheng Liu, Minpeng Liao

TL;DR

ReForm reframes autoformalization as an iterative, reflective process that couples formal statement generation with semantic self-validation. Through Prospective Bounded Sequence Optimization, it optimizes heterogeneous rewards across sequence positions, enabling effective credit assignment for both the final correctness and the quality of intermediate critiques. Empirical results across four benchmarks show a substantial average semantic-accuracy gain (+22.6pp), supported by ConsistencyCheck, a 859-item expert-annotated reliability benchmark that highlights human fallibility yet validates frontier-model evaluation. The approach yields deeper, more thorough refinements and demonstrates robustness across diverse mathematical domains, with human evaluations corroborating semantic fidelity. Overall, ReForm advances autoformalization from single-pass translation to a scalable, reflection-enabled paradigm with reliable semantic assessment.

Abstract

Autoformalization, which translates natural language mathematics into machine-verifiable formal statements, is critical for using formal mathematical reasoning to solve math problems stated in natural language. While Large Language Models can generate syntactically correct formal statements, they often fail to preserve the original problem's semantic intent. This limitation arises from the LLM approaches' treating autoformalization as a simplistic translation task which lacks mechanisms for self-reflection and iterative refinement that human experts naturally employ. To address these issues, we propose ReForm, a Reflective Autoformalization method that tightly integrates semantic consistency evaluation into the autoformalization process. This enables the model to iteratively generate formal statements, assess its semantic fidelity, and self-correct identified errors through progressive refinement. To effectively train this reflective model, we introduce Prospective Bounded Sequence Optimization (PBSO), which employs different rewards at different sequence positions to ensure that the model develops both accurate autoformalization and correct semantic validations, preventing superficial critiques that would undermine the purpose of reflection. Extensive experiments across four autoformalization benchmarks demonstrate that ReForm achieves an average improvement of 22.6 percentage points over the strongest baselines. To further ensure evaluation reliability, we introduce ConsistencyCheck, a benchmark of 859 expert-annotated items that not only validates LLMs as judges but also reveals that autoformalization is inherently difficult: even human experts produce semantic errors in up to 38.5% of cases.

ReForm: Reflective Autoformalization with Prospective Bounded Sequence Optimization

TL;DR

ReForm reframes autoformalization as an iterative, reflective process that couples formal statement generation with semantic self-validation. Through Prospective Bounded Sequence Optimization, it optimizes heterogeneous rewards across sequence positions, enabling effective credit assignment for both the final correctness and the quality of intermediate critiques. Empirical results across four benchmarks show a substantial average semantic-accuracy gain (+22.6pp), supported by ConsistencyCheck, a 859-item expert-annotated reliability benchmark that highlights human fallibility yet validates frontier-model evaluation. The approach yields deeper, more thorough refinements and demonstrates robustness across diverse mathematical domains, with human evaluations corroborating semantic fidelity. Overall, ReForm advances autoformalization from single-pass translation to a scalable, reflection-enabled paradigm with reliable semantic assessment.

Abstract

Autoformalization, which translates natural language mathematics into machine-verifiable formal statements, is critical for using formal mathematical reasoning to solve math problems stated in natural language. While Large Language Models can generate syntactically correct formal statements, they often fail to preserve the original problem's semantic intent. This limitation arises from the LLM approaches' treating autoformalization as a simplistic translation task which lacks mechanisms for self-reflection and iterative refinement that human experts naturally employ. To address these issues, we propose ReForm, a Reflective Autoformalization method that tightly integrates semantic consistency evaluation into the autoformalization process. This enables the model to iteratively generate formal statements, assess its semantic fidelity, and self-correct identified errors through progressive refinement. To effectively train this reflective model, we introduce Prospective Bounded Sequence Optimization (PBSO), which employs different rewards at different sequence positions to ensure that the model develops both accurate autoformalization and correct semantic validations, preventing superficial critiques that would undermine the purpose of reflection. Extensive experiments across four autoformalization benchmarks demonstrate that ReForm achieves an average improvement of 22.6 percentage points over the strongest baselines. To further ensure evaluation reliability, we introduce ConsistencyCheck, a benchmark of 859 expert-annotated items that not only validates LLMs as judges but also reveals that autoformalization is inherently difficult: even human experts produce semantic errors in up to 38.5% of cases.

Paper Structure

This paper contains 33 sections, 5 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 1: Autoformalization performance of ReForm against state-of-the-art models.
  • Figure 2: Overview of ReForm. (Top) Unlike traditional one-pass generation, our ReForm reconceptualizes it as a iterative process that interweaves autoformalization with semantic self-validation. (Bottom) We assign heterogeneous rewards across iterations: auxiliary rewards $r_{\text{aux}}^t$ for critique quality and task reward $r_{\text{task}}$ for final correctness. Prospective bounded returns $G_t$ computed through clipped backward accumulation enable fine-grained credit assignment for each iteration, preventing the degeneration of self-validation while improving autoformalization performance.
  • Figure 3: Training dynamics of our RL process.
  • Figure 4: Iteration Distribution of our ReForm-SFT and RL.