Table of Contents
Fetching ...

Hierarchical Visual Feature Aggregation for OCR-Free Document Understanding

Jaeyoo Park, Jin Young Choi, Jeonghyung Park, Bohyung Han

TL;DR

A novel OCR-free document understanding framework based on pretrained Multimodal Large Language Models (MLLMs), which employs multi-scale visual features to effectively handle various font sizes within document images and introduces a novel instruction tuning task, which facilitates the model's text-reading capability.

Abstract

We present a novel OCR-free document understanding framework based on pretrained Multimodal Large Language Models (MLLMs). Our approach employs multi-scale visual features to effectively handle various font sizes within document images. To address the increasing costs of considering the multi-scale visual inputs for MLLMs, we propose the Hierarchical Visual Feature Aggregation (HVFA) module, designed to reduce the number of input tokens to LLMs. Leveraging a feature pyramid with cross-attentive pooling, our approach effectively manages the trade-off between information loss and efficiency without being affected by varying document image sizes. Furthermore, we introduce a novel instruction tuning task, which facilitates the model's text-reading capability by learning to predict the relative positions of input text, eventually minimizing the risk of truncated text caused by the limited capacity of LLMs. Comprehensive experiments validate the effectiveness of our approach, demonstrating superior performance in various document understanding tasks.

Hierarchical Visual Feature Aggregation for OCR-Free Document Understanding

TL;DR

A novel OCR-free document understanding framework based on pretrained Multimodal Large Language Models (MLLMs), which employs multi-scale visual features to effectively handle various font sizes within document images and introduces a novel instruction tuning task, which facilitates the model's text-reading capability.

Abstract

We present a novel OCR-free document understanding framework based on pretrained Multimodal Large Language Models (MLLMs). Our approach employs multi-scale visual features to effectively handle various font sizes within document images. To address the increasing costs of considering the multi-scale visual inputs for MLLMs, we propose the Hierarchical Visual Feature Aggregation (HVFA) module, designed to reduce the number of input tokens to LLMs. Leveraging a feature pyramid with cross-attentive pooling, our approach effectively manages the trade-off between information loss and efficiency without being affected by varying document image sizes. Furthermore, we introduce a novel instruction tuning task, which facilitates the model's text-reading capability by learning to predict the relative positions of input text, eventually minimizing the risk of truncated text caused by the limited capacity of LLMs. Comprehensive experiments validate the effectiveness of our approach, demonstrating superior performance in various document understanding tasks.

Paper Structure

This paper contains 47 sections, 9 equations, 4 figures, 10 tables.

Figures (4)

  • Figure 1: Illustration of the proposed framework. Our model adopts visual features from multiple scales, which are aggregated through the Hierarchical Visual Feature Aggregation (HVFA) module. The aggregated features are then fed into an LLM to generate language response in an autoregressive manner. The sub-image highlighted by the red box contains the text relevant to an input question, which requires accurately recognizing visually detailed elements from high resolution image.
  • Figure 2: Illustration of the Hierarchical Visual Feature Aggregation (HVFA) module. (Left) HVFA aggregates high-resolution visual features to low-resolution features leveraging feature pyramid structure. (Right) In cross-attentive pooling, each sub-image-feature attends to all of the fine-grained visual features, compressing and preserving more detailed information.
  • Figure 3: Performance analysis on visual and textual inputs. (Left) Impact of visual input scale on model performance. We compare four variants of our model: with the first scale ($S_1$), with the second scale ($S_2$), with multiple scales ($S_1+S_2$), and with multiple scales with HVFA ($S_1+S_2$ w/ HVFA, ours). (Right) Impact of truncated text in the text reading task on model performance by varying sequence length capacity of LLM.
  • Figure 4: Our method vs. UReader ye2023ureader on DocVQA mathew2021docvqa and InfographicVQA mathew2022infographicvqa.