Bridging Vision, Language, and Mathematics: Pictographic Character Reconstruction with Bézier Curves
Zihao Wan, Pau Tong Lin Xu, Fuwen Luo, Ziyue Wang, Peng Li, Yang Liu
TL;DR
This work reframes visual understanding of pictographic characters as a program synthesis problem, where each glyph is reconstructed as an executable sequence of Bézier curves. By presenting an automated Bézier-curve extraction pipeline and training a vision-language system to decompile images into geometric programs, the approach achieves superior geometric fidelity (Geometric Score $G$) compared with zero-shot baselines and demonstrates zero-shot generalization to Oracle Bone Script, suggesting the emergence of an abstract, transferable geometric grammar. A key contribution is the explicit coordinate-axis grounding, which provides spatial context that enables precise ground-truth alignment; supervised fine-tuning outperforms reinforcement learning for this task, highlighting challenges in credit assignment for fine-grained geometric outputs. The insights advance visual understanding beyond pixel-level recognition toward structured, vector-based representations with potential benefits for vector graphics, historical script decipherment, and cross-script generalization in visual reasoning.
Abstract
While Vision-language Models (VLMs) have demonstrated strong semantic capabilities, their ability to interpret the underlying geometric structure of visual information is less explored. Pictographic characters, which combine visual form with symbolic structure, provide an ideal test case for this capability. We formulate this visual recognition challenge in the mathematical domain, where each character is represented by an executable program of geometric primitives. This is framed as a program synthesis task, training a VLM to decompile raster images into programs composed of Bézier curves. Our model, acting as a "visual decompiler", demonstrates performance superior to strong zero-shot baselines, including GPT-4o. The most significant finding is that when trained solely on modern Chinese characters, the model is able to reconstruct ancient Oracle Bone Script in a zero-shot context. This generalization provides strong evidence that the model acquires an abstract and transferable geometric grammar, moving beyond pixel-level pattern recognition to a more structured form of visual understanding.
