Table of Contents
Fetching ...

Hydra: Unifying Document Retrieval and Generation in a Single Vision-Language Model

Athos Georgiou

Abstract

Visual document understanding typically requires separate retrieval and generation models, doubling memory and system complexity. We present Hydra, a dual-head approach that provides both ColBERT-style late-interaction retrieval and autoregressive generation from a single vision-language model (VLM). A single LoRA adapter, trained only for retrieval, is toggled at inference: enabling it produces multi-vector embeddings; disabling it recovers the base model's generation quality -- byte-identical outputs in 100% of 10,500 greedy and stochastic samples, with max delta-ANLS = 0.0044 across 15,301 samples on four VQA benchmarks (three informative; ChartQA is near-zero for both models under greedy decoding) when compared against an independent base-model pipeline. We identify three engineering requirements (attention-mode restoration, lm_head preservation, KV-cache-aware decoding) whose omission silently breaks generation despite correct weight recovery. On ViDoRe V1, Hydra (4B) is within 1 percentage point of a controlled single-head baseline in a single training run, with higher aggregate scores on V2 and V3 that are concentrated on a subset of tasks; multi-seed experiments are needed to confirm these trends. The single-model design reduces peak GPU memory by 41%, though adapter switching introduces throughput overhead under concurrent serving loads. An ablation shows that GritLM-style joint training provides no benefit within the LoRA-based (r=16) training regime. A proof-of-concept extension to Qwen2.5-Omni-3B demonstrates that the mechanism generalizes to audio retrieval and video embedding, with speech generation.

Hydra: Unifying Document Retrieval and Generation in a Single Vision-Language Model

Abstract

Visual document understanding typically requires separate retrieval and generation models, doubling memory and system complexity. We present Hydra, a dual-head approach that provides both ColBERT-style late-interaction retrieval and autoregressive generation from a single vision-language model (VLM). A single LoRA adapter, trained only for retrieval, is toggled at inference: enabling it produces multi-vector embeddings; disabling it recovers the base model's generation quality -- byte-identical outputs in 100% of 10,500 greedy and stochastic samples, with max delta-ANLS = 0.0044 across 15,301 samples on four VQA benchmarks (three informative; ChartQA is near-zero for both models under greedy decoding) when compared against an independent base-model pipeline. We identify three engineering requirements (attention-mode restoration, lm_head preservation, KV-cache-aware decoding) whose omission silently breaks generation despite correct weight recovery. On ViDoRe V1, Hydra (4B) is within 1 percentage point of a controlled single-head baseline in a single training run, with higher aggregate scores on V2 and V3 that are concentrated on a subset of tasks; multi-seed experiments are needed to confirm these trends. The single-model design reduces peak GPU memory by 41%, though adapter switching introduces throughput overhead under concurrent serving loads. An ablation shows that GritLM-style joint training provides no benefit within the LoRA-based (r=16) training regime. A proof-of-concept extension to Qwen2.5-Omni-3B demonstrates that the mechanism generalizes to audio retrieval and video embedding, with speech generation.

Paper Structure

This paper contains 49 sections, 1 equation, 2 figures, 7 tables, 1 algorithm.

Figures (2)

  • Figure 1: Hydra architecture. A single VLM serves two modes by toggling a LoRA adapter at inference time. Left: Retrieval mode (LoRA-on, bidirectional attention) produces 320-dim multi-vector embeddings via custom_text_proj. Right: Generation mode (LoRA-off, causal attention) produces autoregressive text via the base lm_head with KV-cache. The vision encoder is frozen and shared. No weight copying or model reloading occurs between modes. Solid arrows = retrieval path; dashed arrows = generation path.
  • Figure 2: RAG pipeline comparison. Top: ColPali retrieves relevant pages, but a separate LLM is needed for generation at query time---requiring two models in GPU memory (8B+ parameters, 17,913 MB peak VRAM). Bottom: Hydra uses a single 4B-parameter model for both indexing (retrieval head for embeddings) and querying (retrieval head finds top-$k$ pages, generation head answers from them). Both heads share one model in GPU memory, reducing peak VRAM to 10,496 MB (41% savings). Solid blue borders = retrieval; red borders = generation.