Table of Contents
Fetching ...

Marten: Visual Question Answering with Mask Generation for Multi-modal Document Understanding

Zining Wang, Tongkun Guan, Pei Fu, Chen Duan, Qianyi Jiang, Zhentao Guo, Shan Guo, Junfeng Luo, Wei Shen, Xiaokang Yang

TL;DR

The paper tackles document-level visual question answering with dense visual text, where hallucinations arise from weak spatial supervision. It introduces VQAMask, a dual-task pre-training approach that jointly optimizes VQA-based text parsing for semantic alignment and a Mask Generator for spatial alignment, aided by a mask acquisition pipeline to produce ground-truth masks. A large-scale MTMask6M dataset of $6\mathrm{M}$ image-mask pairs supports Stage 1 alignment, followed by Stage 2 generative vision-language training to produce Marten, a training-efficient MLLM for document understanding. Empirical results show Marten outperforms OCR-free baselines across numerous document-centric benchmarks and OCRBench, validating the benefits of explicit spatial supervision in reducing hallucinations and improving text-grounding in visual documents. The work offers a scalable path to robust document understanding by decoupling training-time mask supervision from inference, enabling strong performance with manageable compute.

Abstract

Multi-modal Large Language Models (MLLMs) have introduced a novel dimension to document understanding, i.e., they endow large language models with visual comprehension capabilities; however, how to design a suitable image-text pre-training task for bridging the visual and language modality in document-level MLLMs remains underexplored. In this study, we introduce a novel visual-language alignment method that casts the key issue as a Visual Question Answering with Mask generation (VQAMask) task, optimizing two tasks simultaneously: VQA-based text parsing and mask generation. The former allows the model to implicitly align images and text at the semantic level. The latter introduces an additional mask generator (discarded during inference) to explicitly ensure alignment between visual texts within images and their corresponding image regions at a spatially-aware level. Together, they can prevent model hallucinations when parsing visual text and effectively promote spatially-aware feature representation learning. To support the proposed VQAMask task, we construct a comprehensive image-mask generation pipeline and provide a large-scale dataset with 6M data (MTMask6M). Subsequently, we demonstrate that introducing the proposed mask generation task yields competitive document-level understanding performance. Leveraging the proposed VQAMask, we introduce Marten, a training-efficient MLLM tailored for document-level understanding. Extensive experiments show that our Marten consistently achieves significant improvements among 8B-MLLMs in document-centric tasks. Code and datasets are available at https://github.com/PriNing/Marten.

Marten: Visual Question Answering with Mask Generation for Multi-modal Document Understanding

TL;DR

The paper tackles document-level visual question answering with dense visual text, where hallucinations arise from weak spatial supervision. It introduces VQAMask, a dual-task pre-training approach that jointly optimizes VQA-based text parsing for semantic alignment and a Mask Generator for spatial alignment, aided by a mask acquisition pipeline to produce ground-truth masks. A large-scale MTMask6M dataset of image-mask pairs supports Stage 1 alignment, followed by Stage 2 generative vision-language training to produce Marten, a training-efficient MLLM for document understanding. Empirical results show Marten outperforms OCR-free baselines across numerous document-centric benchmarks and OCRBench, validating the benefits of explicit spatial supervision in reducing hallucinations and improving text-grounding in visual documents. The work offers a scalable path to robust document understanding by decoupling training-time mask supervision from inference, enabling strong performance with manageable compute.

Abstract

Multi-modal Large Language Models (MLLMs) have introduced a novel dimension to document understanding, i.e., they endow large language models with visual comprehension capabilities; however, how to design a suitable image-text pre-training task for bridging the visual and language modality in document-level MLLMs remains underexplored. In this study, we introduce a novel visual-language alignment method that casts the key issue as a Visual Question Answering with Mask generation (VQAMask) task, optimizing two tasks simultaneously: VQA-based text parsing and mask generation. The former allows the model to implicitly align images and text at the semantic level. The latter introduces an additional mask generator (discarded during inference) to explicitly ensure alignment between visual texts within images and their corresponding image regions at a spatially-aware level. Together, they can prevent model hallucinations when parsing visual text and effectively promote spatially-aware feature representation learning. To support the proposed VQAMask task, we construct a comprehensive image-mask generation pipeline and provide a large-scale dataset with 6M data (MTMask6M). Subsequently, we demonstrate that introducing the proposed mask generation task yields competitive document-level understanding performance. Leveraging the proposed VQAMask, we introduce Marten, a training-efficient MLLM tailored for document-level understanding. Extensive experiments show that our Marten consistently achieves significant improvements among 8B-MLLMs in document-centric tasks. Code and datasets are available at https://github.com/PriNing/Marten.

Paper Structure

This paper contains 16 sections, 5 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: Different pre-training paradigms of MLLMs for document understanding: (a) Visual Question Answering (VQA) paradigm that implicitly aligns visual and language modality at the semantic level; (b) our proposed Visual Question Answering with Mask generation (VQAMask) paradigm. Building on VQA, we introduce an additional Mask Generator during training to explicitly align visual texts and their corresponding image regions at a spatially-aware level. During the inference stage, the mask generator is discarded.
  • Figure 2: Overview of our proposed Marten architecture. The training of the model is divided into two stages: 1) VQAMask Alignment Training: the proposed vision-language alignment method, VQAMask, includes two pre-training tasks: VQA-based text parsing and mask generation. By integrating these two tasks, VQAMask not only effectively enables the Marten model to implicitly learn the visual text within images at the semantic level but also explicitly aligns images and text at the spatially-aware level; 2) Vision-Language Generative Training: In the stage, we discard the mask generation task. A wide range of high-quality instruction data is collected to conduct VQA tasks for general document-level understanding.
  • Figure 3: Illustration of the VQAMask alignment training for document parsing question answering. We introduced a total of six tasks, which can be broadly categorized into 1) Read Full Text, Reading Partial Text within Localization, and Visual Text Grounding; 2) Transcription involves converting formulas into LaTeX, tables into markdown or LaTeX, and charts into CSV and markdown formats.
  • Figure 4: Bar chart of scores for each subtask in OCRBench. "KIE" stands for Key Information Extraction, and "HMER" stands for Handwritten Mathematical Expression Recognition.
  • Figure 5: Visualization of output results in VQAMask alignment training. We present samples for three different tasks: 1) Sample A represents full-image visual text recognition, 2) Sample B represents Markdown-style transcription, and 3) Sample C represents reading partial text guided by the bounding box.
  • ...and 4 more figures