Table of Contents
Fetching ...

How to Train Your Long-Context Visual Document Model

Austin Veselka

TL;DR

This first comprehensive, large-scale study of training long-context vision language models up to 344K context, targeting long-document visual question answering with measured transfer to long-context text performance, and extends the known text-to-visual long context transfer to the reverse.

Abstract

We present the first comprehensive, large-scale study of training long-context vision language models up to 344K context, targeting long-document visual question answering with measured transfer to long-context text. While several such strong are open-weight, namely Qwen3 VL and GLM 4.5/6V, their training recipes and data pipelines are not reproducible. We systematically study continued pretraining, supervised finetuning, and preference optimization for 24B and 32B parameter models, backed by extensive LC evaluations and ablations to bridge this gap, and achieve state-of-the-art performance on MMLongBenchDoc for both parameter scales. In addition to this, our key findings include: (i) training on context lengths that match evaluation context lengths outperforms training on longer contexts, (ii) training and evaluating with page indices provides a simple, high-impact boost to long-document performance, (iii) our synthetic data pipelines enable self-improvement via continued pretraining and supervised finetuning, and (iv) we extend the known text-to-visual long context transfer to the reverse, showing that visual long context training transfers to long-context text performance. We also release MMLBD-C, a manually corrected version of MMLongBenchDoc to reduce erroneous and low quality examples in the benchmark.

How to Train Your Long-Context Visual Document Model

TL;DR

This first comprehensive, large-scale study of training long-context vision language models up to 344K context, targeting long-document visual question answering with measured transfer to long-context text performance, and extends the known text-to-visual long context transfer to the reverse.

Abstract

We present the first comprehensive, large-scale study of training long-context vision language models up to 344K context, targeting long-document visual question answering with measured transfer to long-context text. While several such strong are open-weight, namely Qwen3 VL and GLM 4.5/6V, their training recipes and data pipelines are not reproducible. We systematically study continued pretraining, supervised finetuning, and preference optimization for 24B and 32B parameter models, backed by extensive LC evaluations and ablations to bridge this gap, and achieve state-of-the-art performance on MMLongBenchDoc for both parameter scales. In addition to this, our key findings include: (i) training on context lengths that match evaluation context lengths outperforms training on longer contexts, (ii) training and evaluating with page indices provides a simple, high-impact boost to long-document performance, (iii) our synthetic data pipelines enable self-improvement via continued pretraining and supervised finetuning, and (iv) we extend the known text-to-visual long context transfer to the reverse, showing that visual long context training transfers to long-context text performance. We also release MMLBD-C, a manually corrected version of MMLongBenchDoc to reduce erroneous and low quality examples in the benchmark.
Paper Structure (64 sections, 1 equation, 8 figures, 26 tables)

This paper contains 64 sections, 1 equation, 8 figures, 26 tables.

Figures (8)

  • Figure 1: Performance for our best training recipes compared to the base models we train and with the previous SOTA Qwen3 VL 235B A22B. We set a new SOTA on this version of MMLongBenchDoc mmlbd with SFT + CPT outperforming LongPO. 'Distill' describes the answer generation pipeline. We include scores for the self-improving setting using Mistral and its CPT checkpoint for answer generation with our recursive pipeline. See Appendix \ref{['pa:main_recipes']} for specific training recipes.
  • Figure 2: Overview of the scraped PDF corpus: (left) total pages by top-level category (categories are recursively refined to generate search queries); (right) distribution of number of pages per PDF.
  • Figure 3: Length distributions of training examples. (Left) CPT example length (tokens): image tokens are estimated as 1024 tokens per page; text-only samples shorter than 1024 tokens are clipped to 1024. Note that the LC text data from Prolong is very strongly skewed towards short examples. (Right) SFT example length (pages).
  • Figure 4: Distribution of number of pages per PDF in the PDFA English split.
  • Figure 5: Top subcategories by total pages within the scraped PDF corpus (grouped by parent category).
  • ...and 3 more figures