Table of Contents
Fetching ...

Simple Vision-Language Math Reasoning via Rendered Text

Matvey Skripkin, Elizaveta Goncharova, Andrey Kuznetsov

TL;DR

This work addresses the challenge of multimodal math reasoning by converting text-only math problems into rendered LaTeX images paired with structured chain-of-thought prompts. A lightweight, multi-encoder fusion approach integrates vision encoders with a small LLM via adapters, using two fusion strategies and a two-stage training pipeline (adapter pre-training followed by supervised fine-tuning). Through extensive ablations across rendering fidelity, prompt design, and fusion choices, the method achieves competitive results on standard math benchmarks and retains strong performance on general-domain vision–language tasks. The approach demonstrates that rendering fidelity and prompt engineering are key drivers of multimodal reasoning, offering a practical pathway to strong math-capable VLMs without large-scale model investment.

Abstract

We present a lightweight yet effective pipeline for training vision-language models to solve math problems by rendering LaTeX encoded equations into images and pairing them with structured chain-of-thought prompts. This simple text-to-vision augmentation enables compact multimodal architectures to achieve state-of-the-art reasoning accuracy. Through systematic ablations, we find that rendering fidelity and prompt design are the primary drivers of performance. Despite its simplicity, our approach consistently matches or surpasses both open-source and proprietary math-focused vision-language solvers on widely used benchmarks, while preserving broad general-domain competence - showing gains on tasks such as MMMU, ChartQA, and DocVQA of up to 20%.

Simple Vision-Language Math Reasoning via Rendered Text

TL;DR

This work addresses the challenge of multimodal math reasoning by converting text-only math problems into rendered LaTeX images paired with structured chain-of-thought prompts. A lightweight, multi-encoder fusion approach integrates vision encoders with a small LLM via adapters, using two fusion strategies and a two-stage training pipeline (adapter pre-training followed by supervised fine-tuning). Through extensive ablations across rendering fidelity, prompt design, and fusion choices, the method achieves competitive results on standard math benchmarks and retains strong performance on general-domain vision–language tasks. The approach demonstrates that rendering fidelity and prompt engineering are key drivers of multimodal reasoning, offering a practical pathway to strong math-capable VLMs without large-scale model investment.

Abstract

We present a lightweight yet effective pipeline for training vision-language models to solve math problems by rendering LaTeX encoded equations into images and pairing them with structured chain-of-thought prompts. This simple text-to-vision augmentation enables compact multimodal architectures to achieve state-of-the-art reasoning accuracy. Through systematic ablations, we find that rendering fidelity and prompt design are the primary drivers of performance. Despite its simplicity, our approach consistently matches or surpasses both open-source and proprietary math-focused vision-language solvers on widely used benchmarks, while preserving broad general-domain competence - showing gains on tasks such as MMMU, ChartQA, and DocVQA of up to 20%.

Paper Structure

This paper contains 29 sections, 4 equations, 2 figures, 10 tables.

Figures (2)

  • Figure 1: Example of the rendered math problem with the solution. First part is provided as an image for the VLM, while the solution is in LaTeX
  • Figure 2: SFT training data distribution