Table of Contents
Fetching ...

Enhancing Geometric Perception in VLMs via Translator-Guided Reinforcement Learning

Hao Yu, Shuning Jia, Guanghao Li, Wenhao Jiang, Chun Yuan

TL;DR

Experimental results demonstrate that, while supervised fine-tuning (SFT) offers only marginal improvements and may even impair performance in out-of-domain scenarios, GeoDPO achieves substantial gains, underscore the superior performance and generalization ability of GeoDPO over SFT.

Abstract

Vision-language models (VLMs) often struggle with geometric reasoning due to their limited perception of fundamental diagram elements. To tackle this challenge, we introduce GeoPerceive, a benchmark comprising diagram instances paired with domain-specific language (DSL) representations, along with an efficient automatic data generation pipeline. This design enables the isolated evaluation of geometric perception independently from reasoning. To exploit the data provided by GeoPerceive for enhancing the geometric perception capabilities of VLMs, we propose GeoDPO, a translator-guided reinforcement learning (RL) framework. GeoDPO employs an NL-to-DSL translator, which is trained on synthetic pairs generated by the data engine of GeoPerceive, to bridge natural language and DSL. This translator facilitates the computation of fine-grained, DSL-level scores, which serve as reward signals in reinforcement learning. We assess GeoDPO on both in-domain and out-of-domain datasets, spanning tasks in geometric perception as well as downstream reasoning. Experimental results demonstrate that, while supervised fine-tuning (SFT) offers only marginal improvements and may even impair performance in out-of-domain scenarios, GeoDPO achieves substantial gains: $+26.5\%$ on in-domain data, $+8.0\%$ on out-of-domain data, and $+39.0\%$ on downstream reasoning tasks. These findings underscore the superior performance and generalization ability of GeoDPO over SFT. All codes are released at https://github.com/Longin-Yu/GeoPerceive to ensure reproducibility.

Enhancing Geometric Perception in VLMs via Translator-Guided Reinforcement Learning

TL;DR

Experimental results demonstrate that, while supervised fine-tuning (SFT) offers only marginal improvements and may even impair performance in out-of-domain scenarios, GeoDPO achieves substantial gains, underscore the superior performance and generalization ability of GeoDPO over SFT.

Abstract

Vision-language models (VLMs) often struggle with geometric reasoning due to their limited perception of fundamental diagram elements. To tackle this challenge, we introduce GeoPerceive, a benchmark comprising diagram instances paired with domain-specific language (DSL) representations, along with an efficient automatic data generation pipeline. This design enables the isolated evaluation of geometric perception independently from reasoning. To exploit the data provided by GeoPerceive for enhancing the geometric perception capabilities of VLMs, we propose GeoDPO, a translator-guided reinforcement learning (RL) framework. GeoDPO employs an NL-to-DSL translator, which is trained on synthetic pairs generated by the data engine of GeoPerceive, to bridge natural language and DSL. This translator facilitates the computation of fine-grained, DSL-level scores, which serve as reward signals in reinforcement learning. We assess GeoDPO on both in-domain and out-of-domain datasets, spanning tasks in geometric perception as well as downstream reasoning. Experimental results demonstrate that, while supervised fine-tuning (SFT) offers only marginal improvements and may even impair performance in out-of-domain scenarios, GeoDPO achieves substantial gains: on in-domain data, on out-of-domain data, and on downstream reasoning tasks. These findings underscore the superior performance and generalization ability of GeoDPO over SFT. All codes are released at https://github.com/Longin-Yu/GeoPerceive to ensure reproducibility.
Paper Structure (39 sections, 18 equations, 6 figures, 12 tables, 1 algorithm)

This paper contains 39 sections, 18 equations, 6 figures, 12 tables, 1 algorithm.

Figures (6)

  • Figure 1: Pseudo-code of existing DSLs' common literals AlphaGeometryGeomodelbuilderInter-GPS. The three DSL programs on the right are semantically equivalent, each describing the same diagram shown on the left. This one-to-many correspondence results in non-unique interpretations, thereby hindering rigorous and consistent evaluation.
  • Figure 2: Illustrations of GeoDSL syntax. A program may consist of up to four sections: points, lines, circles, and explicit constraints. Point–curve incidence relationships are automatically inferred within curve declarations.
  • Figure 3: Pipelines of GeoPerceive and GeoDPO.
  • Figure 4: Failure Cases of Solving Engine.
  • Figure 5: Randomly selected GeoPerceive diagrams. For each generation iteration, two sampled figures are displayed to illustrate the progressive increase in geometric complexity.
  • ...and 1 more figures