Table of Contents
Fetching ...

QID: Efficient Query-Informed ViTs in Data-Scarce Regimes for OCR-free Visual Document Understanding

Binh M. Le, Shaoyuan Xu, Jinmiao Fu, Zhishen Huang, Moyan Li, Yanhui Guo, Hongdong Li, Sameera Ramasinghe, Bryan Wang

TL;DR

QID tackles the challenge of data-scarce fine-tuning for OCR-free Visual Document Understanding by injecting a single query embedding into the vision encoder without modifying the core attention blocks. It employs a dual-module design: a query-aware module with fuse and defuse learning to align attention with query-relevant regions, and a query-agnostic module that adds a sinusoidal positional bias to stabilize visual representations. Across dense and scene-text datasets, QID yields consistent improvements over SoTA baselines with very modest inference overhead, demonstrating high data efficiency and robustness in OCR-free VDU scenarios. The approach preserves architectural integrity while enhancing visual-semantic grounding, offering practical benefits for document understanding tasks with limited labeled data.

Abstract

In Visual Document Understanding (VDU) tasks, fine-tuning a pre-trained Vision-Language Model (VLM) with new datasets often falls short in optimizing the vision encoder to identify query-specific regions in text-rich document images. Existing methods that directly inject queries into model layers by modifying the network architecture often struggle to adapt to new datasets with limited annotations. To address this, we introduce QID, a novel, streamlined, architecture-preserving approach that integrates query embeddings into the vision encoder, leading to notable performance gains, particularly in data-scarce fine-tuning scenarios. Specifically, our approach introduces a dual-module framework: a query-aware module that generates a unique query vector to precisely guide the model's focus, as well as a query-agnostic module that captures the positional relationships among tokens, ensuring robust spatial understanding. Notably, both modules operate independently of the vision attention blocks, facilitating targeted learning of query embeddings and enhancing visual semantic identification. Experiments with OCR-free VLMs across multiple datasets demonstrate significant performance improvements using our method, especially in handling text-rich documents in data-scarce environments.

QID: Efficient Query-Informed ViTs in Data-Scarce Regimes for OCR-free Visual Document Understanding

TL;DR

QID tackles the challenge of data-scarce fine-tuning for OCR-free Visual Document Understanding by injecting a single query embedding into the vision encoder without modifying the core attention blocks. It employs a dual-module design: a query-aware module with fuse and defuse learning to align attention with query-relevant regions, and a query-agnostic module that adds a sinusoidal positional bias to stabilize visual representations. Across dense and scene-text datasets, QID yields consistent improvements over SoTA baselines with very modest inference overhead, demonstrating high data efficiency and robustness in OCR-free VDU scenarios. The approach preserves architectural integrity while enhancing visual-semantic grounding, offering practical benefits for document understanding tasks with limited labeled data.

Abstract

In Visual Document Understanding (VDU) tasks, fine-tuning a pre-trained Vision-Language Model (VLM) with new datasets often falls short in optimizing the vision encoder to identify query-specific regions in text-rich document images. Existing methods that directly inject queries into model layers by modifying the network architecture often struggle to adapt to new datasets with limited annotations. To address this, we introduce QID, a novel, streamlined, architecture-preserving approach that integrates query embeddings into the vision encoder, leading to notable performance gains, particularly in data-scarce fine-tuning scenarios. Specifically, our approach introduces a dual-module framework: a query-aware module that generates a unique query vector to precisely guide the model's focus, as well as a query-agnostic module that captures the positional relationships among tokens, ensuring robust spatial understanding. Notably, both modules operate independently of the vision attention blocks, facilitating targeted learning of query embeddings and enhancing visual semantic identification. Experiments with OCR-free VLMs across multiple datasets demonstrate significant performance improvements using our method, especially in handling text-rich documents in data-scarce environments.

Paper Structure

This paper contains 20 sections, 11 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Illustration of our approach.Top: Unlike previous work (e.g., QA-ViT ganz2024questionaware), our method detaches the query-informed module independently from the attention block, decomposing it to a query-aware module learning single embedding vector and a query-agnostic module, thereby reducing computational demands during training and inference. Bottom: Comparative results of our proposed method on two dense-text and one scene-text image datasets versus baseline methods, applying fine-tuning to the Qwen-VL-Chat model with only 1,000 samples per dataset.
  • Figure 2: Illustration of our end-to-end training procedure. For simplicity, this figure demonstrates how our approach is integrated with the last ViT attention block. Note it can also be applied to other layers of an vision encoder. During fine-tuning stage, only the green modules are optimized. Our query-aware module, enhanced by the fuse and defuse learning steps, makes the query embedding more robust for the vision encoder. Our query-agnostic module offsets distribution shifts caused by the query information and, as it operates independently from the query vector, can be precomputed and saved as a bias term post-training. This efficient learning approach on a single query vector makes our proposed method lightweight and highly effective for VLMs in VDU tasks.
  • Figure 3: Effect of the [EoS] token in the query for highlighting semantic areas in the image li2023clip. The green numbers at the bottom indicate the top-3 cosine similarities between tokens and the image, as computed using CLIP embeddings radford2021learning.
  • Figure 4: Qualitative results between QA-ViT and our QID. Crucial regions are enlarged for better visualization. More visualizations are provided in Supp. Material - Section \ref{['supp:Visualization']}.
  • Figure 5: More qualitative results between between QA-ViT and our QID with Qwen-VL-Chat model. Image regions with answers are highlighted.
  • ...and 1 more figures