Simple Vision-Language Math Reasoning via Rendered Text
Matvey Skripkin, Elizaveta Goncharova, Andrey Kuznetsov
TL;DR
This work addresses the challenge of multimodal math reasoning by converting text-only math problems into rendered LaTeX images paired with structured chain-of-thought prompts. A lightweight, multi-encoder fusion approach integrates vision encoders with a small LLM via adapters, using two fusion strategies and a two-stage training pipeline (adapter pre-training followed by supervised fine-tuning). Through extensive ablations across rendering fidelity, prompt design, and fusion choices, the method achieves competitive results on standard math benchmarks and retains strong performance on general-domain vision–language tasks. The approach demonstrates that rendering fidelity and prompt engineering are key drivers of multimodal reasoning, offering a practical pathway to strong math-capable VLMs without large-scale model investment.
Abstract
We present a lightweight yet effective pipeline for training vision-language models to solve math problems by rendering LaTeX encoded equations into images and pairing them with structured chain-of-thought prompts. This simple text-to-vision augmentation enables compact multimodal architectures to achieve state-of-the-art reasoning accuracy. Through systematic ablations, we find that rendering fidelity and prompt design are the primary drivers of performance. Despite its simplicity, our approach consistently matches or surpasses both open-source and proprietary math-focused vision-language solvers on widely used benchmarks, while preserving broad general-domain competence - showing gains on tasks such as MMMU, ChartQA, and DocVQA of up to 20%.
