Table of Contents
Fetching ...

UNIT: Unifying Image and Text Recognition in One Vision Encoder

Yi Zhu, Yanpeng Zhou, Chunwei Wang, Yang Cao, Jianhua Han, Lu Hou, Hang Xu

TL;DR

UNIT, a novel training framework aimed at UNifying Image and Text recognition within a single model, significantly outperforms existing methods on document-related tasks while maintaining the performances on natural images, demonstrating its ability to substantially enhance text recognition without compromising its core image recognition capabilities.

Abstract

Currently, vision encoder models like Vision Transformers (ViTs) typically excel at image recognition tasks but cannot simultaneously support text recognition like human visual recognition. To address this limitation, we propose UNIT, a novel training framework aimed at UNifying Image and Text recognition within a single model. Starting with a vision encoder pre-trained with image recognition tasks, UNIT introduces a lightweight language decoder for predicting text outputs and a lightweight vision decoder to prevent catastrophic forgetting of the original image encoding capabilities. The training process comprises two stages: intra-scale pretraining and inter-scale finetuning. During intra-scale pretraining, UNIT learns unified representations from multi-scale inputs, where images and documents are at their commonly used resolution, to enable fundamental recognition capability. In the inter-scale finetuning stage, the model introduces scale-exchanged data, featuring images and documents at resolutions different from the most commonly used ones, to enhance its scale robustness. Notably, UNIT retains the original vision encoder architecture, making it cost-free in terms of inference and deployment. Experiments across multiple benchmarks confirm that our method significantly outperforms existing methods on document-related tasks (e.g., OCR and DocQA) while maintaining the performances on natural images, demonstrating its ability to substantially enhance text recognition without compromising its core image recognition capabilities.

UNIT: Unifying Image and Text Recognition in One Vision Encoder

TL;DR

UNIT, a novel training framework aimed at UNifying Image and Text recognition within a single model, significantly outperforms existing methods on document-related tasks while maintaining the performances on natural images, demonstrating its ability to substantially enhance text recognition without compromising its core image recognition capabilities.

Abstract

Currently, vision encoder models like Vision Transformers (ViTs) typically excel at image recognition tasks but cannot simultaneously support text recognition like human visual recognition. To address this limitation, we propose UNIT, a novel training framework aimed at UNifying Image and Text recognition within a single model. Starting with a vision encoder pre-trained with image recognition tasks, UNIT introduces a lightweight language decoder for predicting text outputs and a lightweight vision decoder to prevent catastrophic forgetting of the original image encoding capabilities. The training process comprises two stages: intra-scale pretraining and inter-scale finetuning. During intra-scale pretraining, UNIT learns unified representations from multi-scale inputs, where images and documents are at their commonly used resolution, to enable fundamental recognition capability. In the inter-scale finetuning stage, the model introduces scale-exchanged data, featuring images and documents at resolutions different from the most commonly used ones, to enhance its scale robustness. Notably, UNIT retains the original vision encoder architecture, making it cost-free in terms of inference and deployment. Experiments across multiple benchmarks confirm that our method significantly outperforms existing methods on document-related tasks (e.g., OCR and DocQA) while maintaining the performances on natural images, demonstrating its ability to substantially enhance text recognition without compromising its core image recognition capabilities.
Paper Structure (14 sections, 6 equations, 4 figures, 7 tables)

This paper contains 14 sections, 6 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 1: Overview of UNIT Architecture. The model processes high-resolution documents and low-resolution images, generating a set of visual tokens. These tokens pass through an input embedding layer, with document tokens fed into the language decoder to predict text sequences, enhancing the model's text recognition capability. Simultaneously, to preserve the model's original image encoding ability, the visual tokens from natural images are reconstructed via a lightweight vision decoder, mimicking the output of the teacher model. Additionally, an image captioning task is included alongside the OCR task to further enhance image understanding.
  • Figure 2: Illustration of the UNIT training paradigm. The (a) intra-scale pretraining stage processes images and documents at their commonly used resolutions to integrate basic text recognition with existing image recognition capabilities. The (b) inter-scale finetuning stage processes scale-exchanged data and tasks to enhance scale robustness, benefiting downstream document analysis tasks when integrated into (c) LVLMs applications.
  • Figure 3: Visualization examples of text recognition. UNIT predicts accurate OCR results even across diverse scenarios, e.g., handwritten texts, receipts, and interleaved image-text documents. Please see clearly by zooming in. More promising examples are shown in the supplementary material.
  • Figure 4: Visualization examples of downstream document analysis tasks. UNIT accurately recognizes tiny words and digits, providing correct answers for document-related questions from users.