Table of Contents
Fetching ...

Wu's Method can Boost Symbolic AI to Rival Silver Medalists and AlphaGeometry to Outperform Gold Medalists at IMO Geometry

Shiven Sinha, Ameya Prabhu, Ponnurangam Kumaraguru, Siddharth Bhat, Matthias Bethge

TL;DR

This work reevaluates Wu's method for IMO geometry on the IMO-AG-30 benchmark and compares it with AlphaGeometry's neuro-symbolic approach. It demonstrates that Wu's method can solve 15 problems, and that a combined Wu+DD+AR pipeline solves 21/30 on CPU with a 5-minute limit, approaching AlphaGeometry's performance. Moreover, combining Wu with AlphaGeometry yields 27/30, surpassing gold-medalist performance and establishing a fully symbolic baseline that rivals silver medals. The results highlight the complementary strengths of algebraic and synthetic methods and motivate broader benchmarks and continued software improvements for automated geometric reasoning.

Abstract

Proving geometric theorems constitutes a hallmark of visual reasoning combining both intuitive and logical skills. Therefore, automated theorem proving of Olympiad-level geometry problems is considered a notable milestone in human-level automated reasoning. The introduction of AlphaGeometry, a neuro-symbolic model trained with 100 million synthetic samples, marked a major breakthrough. It solved 25 of 30 International Mathematical Olympiad (IMO) problems whereas the reported baseline based on Wu's method solved only ten. In this note, we revisit the IMO-AG-30 Challenge introduced with AlphaGeometry, and find that Wu's method is surprisingly strong. Wu's method alone can solve 15 problems, and some of them are not solved by any of the other methods. This leads to two key findings: (i) Combining Wu's method with the classic synthetic methods of deductive databases and angle, ratio, and distance chasing solves 21 out of 30 methods by just using a CPU-only laptop with a time limit of 5 minutes per problem. Essentially, this classic method solves just 4 problems less than AlphaGeometry and establishes the first fully symbolic baseline strong enough to rival the performance of an IMO silver medalist. (ii) Wu's method even solves 2 of the 5 problems that AlphaGeometry failed to solve. Thus, by combining AlphaGeometry with Wu's method we set a new state-of-the-art for automated theorem proving on IMO-AG-30, solving 27 out of 30 problems, the first AI method which outperforms an IMO gold medalist.

Wu's Method can Boost Symbolic AI to Rival Silver Medalists and AlphaGeometry to Outperform Gold Medalists at IMO Geometry

TL;DR

This work reevaluates Wu's method for IMO geometry on the IMO-AG-30 benchmark and compares it with AlphaGeometry's neuro-symbolic approach. It demonstrates that Wu's method can solve 15 problems, and that a combined Wu+DD+AR pipeline solves 21/30 on CPU with a 5-minute limit, approaching AlphaGeometry's performance. Moreover, combining Wu with AlphaGeometry yields 27/30, surpassing gold-medalist performance and establishing a fully symbolic baseline that rivals silver medals. The results highlight the complementary strengths of algebraic and synthetic methods and motivate broader benchmarks and continued software improvements for automated geometric reasoning.

Abstract

Proving geometric theorems constitutes a hallmark of visual reasoning combining both intuitive and logical skills. Therefore, automated theorem proving of Olympiad-level geometry problems is considered a notable milestone in human-level automated reasoning. The introduction of AlphaGeometry, a neuro-symbolic model trained with 100 million synthetic samples, marked a major breakthrough. It solved 25 of 30 International Mathematical Olympiad (IMO) problems whereas the reported baseline based on Wu's method solved only ten. In this note, we revisit the IMO-AG-30 Challenge introduced with AlphaGeometry, and find that Wu's method is surprisingly strong. Wu's method alone can solve 15 problems, and some of them are not solved by any of the other methods. This leads to two key findings: (i) Combining Wu's method with the classic synthetic methods of deductive databases and angle, ratio, and distance chasing solves 21 out of 30 methods by just using a CPU-only laptop with a time limit of 5 minutes per problem. Essentially, this classic method solves just 4 problems less than AlphaGeometry and establishes the first fully symbolic baseline strong enough to rival the performance of an IMO silver medalist. (ii) Wu's method even solves 2 of the 5 problems that AlphaGeometry failed to solve. Thus, by combining AlphaGeometry with Wu's method we set a new state-of-the-art for automated theorem proving on IMO-AG-30, solving 27 out of 30 problems, the first AI method which outperforms an IMO gold medalist.
Paper Structure (5 sections, 3 figures)

This paper contains 5 sections, 3 figures.

Figures (3)

  • Figure 1: A) Performance across symbolic and LLM-Augmented methods on the IMO-AG-30 problem set, along with human performance. We set a strong baseline among symbolic systems at the standard of a silver medalist and outperform a gold medalist by a margin of one problem. Right: Diagram showing how the different methods overlap or complement each other on the IMO-AG-30 problems.
  • Figure 2: Left: Performance across symbolic and LLM-Augmented methods on the IMO-AG-30 problem set, along with human performance. We set a strong baseline among symbolic systems at the standard of a silver medalist and outperform a gold medalist by a margin of one problem. Right: Diagram showing how the different methods overlap or complement each other on the IMO-AG-30 problems.
  • Figure 3: Problem 2008-P1B JGEX (Above) and 2021-P3 (Below) with Input (Left) and Generated Solution (Right) for Wu's method. This illustration can be reproduced by opening the .gex files provided alongside on the HuggingFace repository and pressing Run.