Uni-MuMER: Unified Multi-Task Fine-Tuning of Vision-Language Model for Handwritten Mathematical Expression Recognition
Yu Li, Jin Jiang, Jianhua Zhu, Shuai Peng, Baole Wei, Yuxuan Zhou, Liangcai Gao
TL;DR
Handwritten Math Expression Recognition (HMER) remains challenging due to flexible 2D symbol layouts and handwriting variability. The authors propose Uni-MuMER, a unified multi-task fine-tuning framework that trains a generalist Vision-Language Model (VLM) end-to-end on HMER data without altering the architecture, incorporating three data-driven tasks: Tree-Aware Chain-of-Thought (Tree-CoT) for structural reasoning, Error-Driven Learning (EDL) for reducing visually-confusable symbol errors, and Symbol Counting (SC) for long-expression consistency. By leveraging diverse datasets (CROHME, CROHME 2023, HME100K, MathWriting, Im2LaTeXv2) and optionally external data, Uni-MuMER achieves state-of-the-art performance across CROHME and HME100K, notably in zero-shot settings, and demonstrates strong cross-dataset generalization. The approach also explores data diversity and mixing general-domain data, achieving fast inference with vLLM remakes and releasing code and models for reproducibility. This work offers a scalable, interpretable path to applying large VLMs to structured recognition tasks like HMER, with potential impact on document understanding and scientific digital preservation.
Abstract
Handwritten Mathematical Expression Recognition (HMER) remains a persistent challenge in Optical Character Recognition (OCR) due to the inherent freedom of symbol layouts and variability in handwriting styles. Prior methods have faced performance bottlenecks by proposing isolated architectural modifications, making them difficult to integrate coherently into a unified framework. Meanwhile, recent advances in pretrained vision-language models (VLMs) have demonstrated strong cross-task generalization, offering a promising foundation for developing unified solutions. In this paper, we introduce Uni-MuMER, which fully fine-tunes a VLM for the HMER task without modifying its architecture, effectively injecting domain-specific knowledge into a generalist framework. Our method integrates three data-driven tasks: Tree-Aware Chain-of-Thought (Tree-CoT) for structured spatial reasoning, Error-Driven Learning (EDL) for reducing confusion among visually similar characters, and Symbol Counting (SC) for improving recognition consistency in long expressions. Experiments on the CROHME and HME100K datasets show that Uni-MuMER achieves super state-of-the-art performance, outperforming the best lightweight specialized model SSAN by 16.31\% and the top-performing VLM Gemini2.5-flash by 24.42\% under zero-shot setting. Our datasets, models, and code are open-sourced at: {https://github.com/BFlameSwift/Uni-MuMER
