Arctic-TILT. Business Document Understanding at Sub-Billion Scale

Łukasz Borchmann; Michał Pietruszka; Wojciech Jaśkowski; Dawid Jurkiewicz; Piotr Halama; Paweł Józiak; Łukasz Garncarek; Paweł Liskowski; Karolina Szyndler; Andrzej Gretkowski; Julita Ołtusek; Gabriela Nowakowska; Artur Zawłocki; Łukasz Duhr; Paweł Dyda; Michał Turski

Arctic-TILT. Business Document Understanding at Sub-Billion Scale

Łukasz Borchmann, Michał Pietruszka, Wojciech Jaśkowski, Dawid Jurkiewicz, Piotr Halama, Paweł Józiak, Łukasz Garncarek, Paweł Liskowski, Karolina Szyndler, Andrzej Gretkowski, Julita Ołtusek, Gabriela Nowakowska, Artur Zawłocki, Łukasz Duhr, Paweł Dyda, Michał Turski

TL;DR

The Arctic-TILT achieving accuracy on par with models 1000$\times$ its size on these use cases and establishes state-of-the-art results on seven diverse Document Understanding benchmarks, as well as provides reliable confidence scores and quick inference, which are essential for processing files in large-scale or time-sensitive enterprise environments.

Abstract

The vast portion of workloads employing LLMs involves answering questions grounded on PDF or scan content. We introduce the Arctic-TILT achieving accuracy on par with models 1000$\times$ its size on these use cases. It can be fine-tuned and deployed on a single 24GB GPU, lowering operational costs while processing Visually Rich Documents with up to 400k tokens. The model establishes state-of-the-art results on seven diverse Document Understanding benchmarks, as well as provides reliable confidence scores and quick inference, which are essential for processing files in large-scale or time-sensitive enterprise environments.

Arctic-TILT. Business Document Understanding at Sub-Billion Scale

TL;DR

The Arctic-TILT achieving accuracy on par with models 1000

its size on these use cases and establishes state-of-the-art results on seven diverse Document Understanding benchmarks, as well as provides reliable confidence scores and quick inference, which are essential for processing files in large-scale or time-sensitive enterprise environments.

Abstract

The vast portion of workloads employing LLMs involves answering questions grounded on PDF or scan content. We introduce the Arctic-TILT achieving accuracy on par with models 1000

its size on these use cases. It can be fine-tuned and deployed on a single 24GB GPU, lowering operational costs while processing Visually Rich Documents with up to 400k tokens. The model establishes state-of-the-art results on seven diverse Document Understanding benchmarks, as well as provides reliable confidence scores and quick inference, which are essential for processing files in large-scale or time-sensitive enterprise environments.

Paper Structure (47 sections, 1 equation, 11 figures, 5 tables)

This paper contains 47 sections, 1 equation, 11 figures, 5 tables.

Introduction
Related Works
Arctic-TILT
Fusion of Text and Vision
Fusion by Tensor Product.
Module placement.
Long Context Support
Chunked processing.
Nested stack checkpointing.
Random chunks.
Pretraining and Finetuning
Experiments
Document Visual QA and KIE
Multi-page.
Single-page.
...and 32 more sections

Figures (11)

Figure 1: Arctic-TILT consumes long, richly formatted PDFs given a single, cost-efficient GPU and can produce their summary, answer questions, and extract values, outperforming vastly heavier LLMs and LVLMs.
Figure 2: Arctic-TILT modality fusion. It can be seen as attention with role vector tensortproduct simplified concerning we calculate it over a pair of aligned text and image tokens.
Figure 3: The Arctic-TILT encoder block combines Contextualized Vision from U-Net and Textual Semantics from input embeddings through Fusion (F) operation. The Multi-Head Attention is augmented with 1D and 2D positional biases to capture spatial and sequential arrangement. This procedure is repeated in each layer (Nx), allowing to process integrated information further.
Figure 4: Downstream finetuning with different fusion setups. Two internal benchmarks (first and second), followed by DocVQA (third) and InfographicsVQA (fourth). Y-axis is ANLS.
Figure 5: An illustration of sparse attention matrices assuming a two-layer encoder and decoder. The original TILT (A) consumes the complete input at once, in contrast to Arctic-TILT (B) with blockwise attention
...and 6 more figures

Arctic-TILT. Business Document Understanding at Sub-Billion Scale

TL;DR

Abstract

Arctic-TILT. Business Document Understanding at Sub-Billion Scale

Authors

TL;DR

Abstract

Table of Contents

Figures (11)