Table of Contents
Fetching ...

NanoVDR: Distilling a 2B Vision-Language Retriever into a 70M Text-Only Encoder for Visual Document Retrieval

Zhuchenyang Liu, Yao Zhang, Yu Xiao

Abstract

Vision-Language Model (VLM) based retrievers have advanced visual document retrieval (VDR) to impressive quality. They require the same multi-billion parameter encoder for both document indexing and query encoding, incurring high latency and GPU dependence even for plain-text queries. We observe that this design is unnecessarily symmetric: documents are visually complex and demand strong visual understanding, whereas queries are just short text strings. NanoVDR exploits this query--document asymmetry by decoupling the two encoding paths: a frozen 2B VLM teacher indexes documents offline, while a distilled text-only student as small as 69M parameters encodes queries at inference. The key design choice is the distillation objective. Through systematic comparison of six objectives across three backbones and 22 ViDoRe benchmark datasets, we find that pointwise cosine alignment on query text consistently outperforms ranking-based and contrastive alternatives, while requiring only pre-cached teacher query embeddings and no document processing during training. Furthermore, we identify cross-lingual transfer as the primary performance bottleneck, and resolve it cheaply by augmenting training data with machine-translated queries. The resulting NanoVDR-S-Multi (DistilBERT, 69M) retains 95.1\% of teacher quality and outperforms DSE-Qwen2 (2B) on v2 and v3 with 32$\times$ fewer parameters and 50$\times$ lower CPU query latency, at a total training cost under 13 GPU-hours.

NanoVDR: Distilling a 2B Vision-Language Retriever into a 70M Text-Only Encoder for Visual Document Retrieval

Abstract

Vision-Language Model (VLM) based retrievers have advanced visual document retrieval (VDR) to impressive quality. They require the same multi-billion parameter encoder for both document indexing and query encoding, incurring high latency and GPU dependence even for plain-text queries. We observe that this design is unnecessarily symmetric: documents are visually complex and demand strong visual understanding, whereas queries are just short text strings. NanoVDR exploits this query--document asymmetry by decoupling the two encoding paths: a frozen 2B VLM teacher indexes documents offline, while a distilled text-only student as small as 69M parameters encodes queries at inference. The key design choice is the distillation objective. Through systematic comparison of six objectives across three backbones and 22 ViDoRe benchmark datasets, we find that pointwise cosine alignment on query text consistently outperforms ranking-based and contrastive alternatives, while requiring only pre-cached teacher query embeddings and no document processing during training. Furthermore, we identify cross-lingual transfer as the primary performance bottleneck, and resolve it cheaply by augmenting training data with machine-translated queries. The resulting NanoVDR-S-Multi (DistilBERT, 69M) retains 95.1\% of teacher quality and outperforms DSE-Qwen2 (2B) on v2 and v3 with 32 fewer parameters and 50 lower CPU query latency, at a total training cost under 13 GPU-hours.
Paper Structure (52 sections, 4 equations, 3 figures, 16 tables)

This paper contains 52 sections, 4 equations, 3 figures, 16 tables.

Figures (3)

  • Figure 1: Motivation and deployment advantage of NanoVDR. (a) Symmetric vs. asymmetric retrieval: Current VDR systems (top) use the same heavy VLM encoder (2B) for both offline document indexing and online query encoding ($>$2,000 ms per query). NanoVDR (bottom) decouples the two: the frozen VLM teacher encodes documents offline, while a distilled text-only student (70M) encodes queries online in $\sim$50 ms on CPU. (b) Performance vs. latency: On the ViDoRe benchmark (mean NDCG@5 across v1/v2/v3), NanoVDR models achieve near-teacher accuracy (dashed line) at 50--143$\times$ lower CPU latency. Bubble size is proportional to model parameter count; the star ($\bigstar$) marks NanoVDR-S-Multi, the multilingual-augmented variant (§\ref{['sec:multilingual-augment']}).
  • Figure 2: Query-centric distillation training of NanoVDR.Left: The frozen VLM teacher pre-caches training query embeddings via text-only inference. Right: The student text encoder is trained to minimize $\mathcal{L}_\text{align} = 1 - \cos(\mathbf{v}^Q_t, \mathbf{v}^Q_s)$, requiring no document images or negative sampling.
  • Figure 3: Data efficiency of NanoVDR-S. NDCG@5 vs. fraction of 711K training pairs, with teacher upper bounds (dashed). Percentages indicate retention (student/teacher). Diminishing returns are pronounced after 25% of training data.