Table of Contents
Fetching ...

SV-RAG: LoRA-Contextualizing Adaptation of MLLMs for Long Document Understanding

Jian Chen, Ruiyi Zhang, Yufan Zhou, Tong Yu, Franck Dernoncourt, Jiuxiang Gu, Ryan A. Rossi, Changyou Chen, Tong Sun

TL;DR

SV-RAG tackles long visually-rich document understanding by combining a multimodal retriever and a dual-adapter QA model on a shared LLM backbone. It introduces Col-Retrieval with contextualized late interaction and dual LoRA adapters to keep memory usage low while maintaining high accuracy. The VisR-Bench dataset provides targeted evaluation for figure-rich, multi-page documents, and SV-RAG achieves state-of-the-art or competitive results on multiple public benchmarks, including MMLongBench-Doc, SlideVQA, DocVQA, and DUDE, with strong efficiency advantages over processing all pages. This work demonstrates that lightweight, edge-friendly MLLMs can handle multipage QA tasks effectively, enabling practical deployment in resource-constrained environments.

Abstract

Multimodal large language models (MLLMs) have recently shown great progress in text-rich image understanding, yet they still struggle with complex, multi-page visually-rich documents. Traditional methods using document parsers for retrieval-augmented generation suffer from performance and efficiency limitations, while directly presenting all pages to MLLMs leads to inefficiencies, especially with lengthy ones. In this work, we present a novel framework named **S**elf-**V**isual **R**etrieval-**A**ugmented **G**eneration (SV-RAG), which can broaden horizons of any MLLM to support long-document understanding. We demonstrate that **MLLMs themselves can be an effective multimodal retriever** to fetch relevant pages and then answer user questions based on these pages. SV-RAG is implemented with two specific MLLM adapters, one for evidence page retrieval and the other for question answering. Empirical results show state-of-the-art performance on public benchmarks, demonstrating the effectiveness of SV-RAG.

SV-RAG: LoRA-Contextualizing Adaptation of MLLMs for Long Document Understanding

TL;DR

SV-RAG tackles long visually-rich document understanding by combining a multimodal retriever and a dual-adapter QA model on a shared LLM backbone. It introduces Col-Retrieval with contextualized late interaction and dual LoRA adapters to keep memory usage low while maintaining high accuracy. The VisR-Bench dataset provides targeted evaluation for figure-rich, multi-page documents, and SV-RAG achieves state-of-the-art or competitive results on multiple public benchmarks, including MMLongBench-Doc, SlideVQA, DocVQA, and DUDE, with strong efficiency advantages over processing all pages. This work demonstrates that lightweight, edge-friendly MLLMs can handle multipage QA tasks effectively, enabling practical deployment in resource-constrained environments.

Abstract

Multimodal large language models (MLLMs) have recently shown great progress in text-rich image understanding, yet they still struggle with complex, multi-page visually-rich documents. Traditional methods using document parsers for retrieval-augmented generation suffer from performance and efficiency limitations, while directly presenting all pages to MLLMs leads to inefficiencies, especially with lengthy ones. In this work, we present a novel framework named **S**elf-**V**isual **R**etrieval-**A**ugmented **G**eneration (SV-RAG), which can broaden horizons of any MLLM to support long-document understanding. We demonstrate that **MLLMs themselves can be an effective multimodal retriever** to fetch relevant pages and then answer user questions based on these pages. SV-RAG is implemented with two specific MLLM adapters, one for evidence page retrieval and the other for question answering. Empirical results show state-of-the-art performance on public benchmarks, demonstrating the effectiveness of SV-RAG.

Paper Structure

This paper contains 43 sections, 3 equations, 10 figures, 6 tables, 1 algorithm.

Figures (10)

  • Figure 1: Overview of the SV-RAG pipeline. The multi-page document and query are encoded by a customized MLLM (yellow). The most relevant page is retrieved through similarity-based matching, and a fine-tuned MLLM (blue) generates the final answer from the evidence.
  • Figure 2: Model overview of SV-RAG. It contains two modules, which are finetuned using LoRA hu2021lora, sharing the same pretrained multimodal LLM backbone. The retrieval module selects evidence pages for the other QA module, which provides responses to user questions.
  • Figure 3: Distribution of document types (left) and average document lengths in each types (right).
  • Figure 4: Top-1 retrieval accuracy on MMLongBench-Doc using different hidden states across all layers of Phi-3-vision.
  • Figure A.1: Example of training pairs within a batch (batch size: 4) for contrastive training, using samples from the SlideVQA dataset.
  • ...and 5 more figures