Table of Contents
Fetching ...

PositionOCR: Augmenting Positional Awareness in Multi-Modal Models via Hybrid Specialist Integration

Chen Duan, Zhentao Guo, Pei Fu, Zining Wang, Kai Zhou, Pengfei Yan

TL;DR

PositionOCR addresses the positional reasoning limitations of multi-modal LLMs by integrating a text spotting specialist with an LLM in a parameter-efficient hybrid architecture. Through a two-stage training regime—specialist pretraining followed by instruction-tuning with an LLM connector—it's able to perform text grounding, text spotting, and VQA with significantly fewer trainable parameters (131M) than typical MLLMs. The approach delivers state-of-the-art results on text grounding and competitive performance on document VQA and OCR tasks across diverse datasets, highlighting the value of specialist-led positional reasoning within a multi-modal framework. This hybrid, instruction-tuned design reduces computational demands while maintaining strong cross-task generalization, offering practical benefits for OCR and document understanding applications.

Abstract

In recent years, Multi-modal Large Language Models (MLLMs) have achieved strong performance in OCR-centric Visual Question Answering (VQA) tasks, illustrating their capability to process heterogeneous data and exhibit adaptability across varied contexts. However, these MLLMs rely on a Large Language Model (LLM) as the decoder, which is primarily designed for linguistic processing, and thus inherently lacks the positional reasoning required for precise visual tasks, such as text spotting and text grounding. Additionally, the extensive parameters of MLLMs necessitate substantial computational resources and large-scale data for effective training. Conversely, text spotting specialists achieve state-of-the-art coordinate predictions but lack semantic reasoning capabilities. This dichotomy motivates our key research question: Can we synergize the efficiency of specialists with the contextual power of LLMs to create a positionally-accurate MLLM? To overcome these challenges, we introduce PositionOCR, a parameter-efficient hybrid architecture that seamlessly integrates a text spotting model's positional strengths with an LLM's contextual reasoning. Comprising 131M trainable parameters, this framework demonstrates outstanding multi-modal processing capabilities, particularly excelling in tasks such as text grounding and text spotting, consistently surpassing traditional MLLMs.

PositionOCR: Augmenting Positional Awareness in Multi-Modal Models via Hybrid Specialist Integration

TL;DR

PositionOCR addresses the positional reasoning limitations of multi-modal LLMs by integrating a text spotting specialist with an LLM in a parameter-efficient hybrid architecture. Through a two-stage training regime—specialist pretraining followed by instruction-tuning with an LLM connector—it's able to perform text grounding, text spotting, and VQA with significantly fewer trainable parameters (131M) than typical MLLMs. The approach delivers state-of-the-art results on text grounding and competitive performance on document VQA and OCR tasks across diverse datasets, highlighting the value of specialist-led positional reasoning within a multi-modal framework. This hybrid, instruction-tuned design reduces computational demands while maintaining strong cross-task generalization, offering practical benefits for OCR and document understanding applications.

Abstract

In recent years, Multi-modal Large Language Models (MLLMs) have achieved strong performance in OCR-centric Visual Question Answering (VQA) tasks, illustrating their capability to process heterogeneous data and exhibit adaptability across varied contexts. However, these MLLMs rely on a Large Language Model (LLM) as the decoder, which is primarily designed for linguistic processing, and thus inherently lacks the positional reasoning required for precise visual tasks, such as text spotting and text grounding. Additionally, the extensive parameters of MLLMs necessitate substantial computational resources and large-scale data for effective training. Conversely, text spotting specialists achieve state-of-the-art coordinate predictions but lack semantic reasoning capabilities. This dichotomy motivates our key research question: Can we synergize the efficiency of specialists with the contextual power of LLMs to create a positionally-accurate MLLM? To overcome these challenges, we introduce PositionOCR, a parameter-efficient hybrid architecture that seamlessly integrates a text spotting model's positional strengths with an LLM's contextual reasoning. Comprising 131M trainable parameters, this framework demonstrates outstanding multi-modal processing capabilities, particularly excelling in tasks such as text grounding and text spotting, consistently surpassing traditional MLLMs.
Paper Structure (18 sections, 2 equations, 3 figures, 8 tables)

This paper contains 18 sections, 2 equations, 3 figures, 8 tables.

Figures (3)

  • Figure 1: (a) Represents the specialist model for text spotting, which consists of an image encoder and decoder. The model can only output the text within the entire image along with its corresponding positions. (b) Represents mainstream MLLMs, where the image encoder extracts visual features and the LLM completes various multi-modal tasks. (c) Our proposed method, PositionOCR, employs a text spotting model that is guided by the LLM to accomplish various multi-modal tasks.
  • Figure 2: Overall framework of PositionOCR. In the first stage, a specialist model is developed, proficient in performing detection and recognition tasks. In the second stage, a Large Language Model (LLM) is introduced to achieve alignment between the two components using data from text spotting, followed by instruction tuning to enable interaction through human language.
  • Figure 3: PositionOCR's visualization results on text granularity include 'Word", "Phrase", "Line", and "Block".