PositionOCR: Augmenting Positional Awareness in Multi-Modal Models via Hybrid Specialist Integration
Chen Duan, Zhentao Guo, Pei Fu, Zining Wang, Kai Zhou, Pengfei Yan
TL;DR
PositionOCR addresses the positional reasoning limitations of multi-modal LLMs by integrating a text spotting specialist with an LLM in a parameter-efficient hybrid architecture. Through a two-stage training regime—specialist pretraining followed by instruction-tuning with an LLM connector—it's able to perform text grounding, text spotting, and VQA with significantly fewer trainable parameters (131M) than typical MLLMs. The approach delivers state-of-the-art results on text grounding and competitive performance on document VQA and OCR tasks across diverse datasets, highlighting the value of specialist-led positional reasoning within a multi-modal framework. This hybrid, instruction-tuned design reduces computational demands while maintaining strong cross-task generalization, offering practical benefits for OCR and document understanding applications.
Abstract
In recent years, Multi-modal Large Language Models (MLLMs) have achieved strong performance in OCR-centric Visual Question Answering (VQA) tasks, illustrating their capability to process heterogeneous data and exhibit adaptability across varied contexts. However, these MLLMs rely on a Large Language Model (LLM) as the decoder, which is primarily designed for linguistic processing, and thus inherently lacks the positional reasoning required for precise visual tasks, such as text spotting and text grounding. Additionally, the extensive parameters of MLLMs necessitate substantial computational resources and large-scale data for effective training. Conversely, text spotting specialists achieve state-of-the-art coordinate predictions but lack semantic reasoning capabilities. This dichotomy motivates our key research question: Can we synergize the efficiency of specialists with the contextual power of LLMs to create a positionally-accurate MLLM? To overcome these challenges, we introduce PositionOCR, a parameter-efficient hybrid architecture that seamlessly integrates a text spotting model's positional strengths with an LLM's contextual reasoning. Comprising 131M trainable parameters, this framework demonstrates outstanding multi-modal processing capabilities, particularly excelling in tasks such as text grounding and text spotting, consistently surpassing traditional MLLMs.
