QID: Efficient Query-Informed ViTs in Data-Scarce Regimes for OCR-free Visual Document Understanding
Binh M. Le, Shaoyuan Xu, Jinmiao Fu, Zhishen Huang, Moyan Li, Yanhui Guo, Hongdong Li, Sameera Ramasinghe, Bryan Wang
TL;DR
QID tackles the challenge of data-scarce fine-tuning for OCR-free Visual Document Understanding by injecting a single query embedding into the vision encoder without modifying the core attention blocks. It employs a dual-module design: a query-aware module with fuse and defuse learning to align attention with query-relevant regions, and a query-agnostic module that adds a sinusoidal positional bias to stabilize visual representations. Across dense and scene-text datasets, QID yields consistent improvements over SoTA baselines with very modest inference overhead, demonstrating high data efficiency and robustness in OCR-free VDU scenarios. The approach preserves architectural integrity while enhancing visual-semantic grounding, offering practical benefits for document understanding tasks with limited labeled data.
Abstract
In Visual Document Understanding (VDU) tasks, fine-tuning a pre-trained Vision-Language Model (VLM) with new datasets often falls short in optimizing the vision encoder to identify query-specific regions in text-rich document images. Existing methods that directly inject queries into model layers by modifying the network architecture often struggle to adapt to new datasets with limited annotations. To address this, we introduce QID, a novel, streamlined, architecture-preserving approach that integrates query embeddings into the vision encoder, leading to notable performance gains, particularly in data-scarce fine-tuning scenarios. Specifically, our approach introduces a dual-module framework: a query-aware module that generates a unique query vector to precisely guide the model's focus, as well as a query-agnostic module that captures the positional relationships among tokens, ensuring robust spatial understanding. Notably, both modules operate independently of the vision attention blocks, facilitating targeted learning of query embeddings and enhancing visual semantic identification. Experiments with OCR-free VLMs across multiple datasets demonstrate significant performance improvements using our method, especially in handling text-rich documents in data-scarce environments.
