Table of Contents
Fetching ...

Jina-VLM: Small Multilingual Vision Language Model

Andreas Koukounas, Georgios Mastrapas, Florian Hönicke, Sedigheh Eslami, Guillaume Roncari, Scott Martens, Han Xiao

TL;DR

Jina-VLM delivers a 2.4B multilingual vision-language model that achieves state-of-the-art multilingual VQA among open 2B-scale VLMs by combining a SigLIP2 vision encoder with a Qwen3 decoder via an attention-pooling connector. A 2-stage training pipeline—alignment with multilingual data and instruction fine-tuning—mitigates language degradation and preserves text-only capabilities. The approach achieves leading results across multilingual benchmarks MMMB and Multilingual MMBench, strong performance on general VQA, and competitive results in mathematical and real-world reasoning, while introducing an efficient resolution-agnostic tiling strategy that reduces visual tokens by 4×. Open-source weights and code are released, providing a practical, accessible option for researchers and practitioners with limited compute. Limitations include tile-induced overhead and partial loss of global context, suggesting avenues for future improvements in high-resolution processing and scaling to larger models.

Abstract

We present Jina-VLM, a 2.4B parameter vision-language model that achieves state-of-the-art multilingual visual question answering among open 2B-scale VLMs. The model couples a SigLIP2 vision encoder with a Qwen3 language backbone through an attention-pooling connector that enables token-efficient processing of arbitrary-resolution images. The model achieves leading results on standard VQA benchmarks and multilingual evaluations while preserving competitive text-only performance. Model weights and code are publicly released at https://huggingface.co/jinaai/jina-vlm .

Jina-VLM: Small Multilingual Vision Language Model

TL;DR

Jina-VLM delivers a 2.4B multilingual vision-language model that achieves state-of-the-art multilingual VQA among open 2B-scale VLMs by combining a SigLIP2 vision encoder with a Qwen3 decoder via an attention-pooling connector. A 2-stage training pipeline—alignment with multilingual data and instruction fine-tuning—mitigates language degradation and preserves text-only capabilities. The approach achieves leading results across multilingual benchmarks MMMB and Multilingual MMBench, strong performance on general VQA, and competitive results in mathematical and real-world reasoning, while introducing an efficient resolution-agnostic tiling strategy that reduces visual tokens by 4×. Open-source weights and code are released, providing a practical, accessible option for researchers and practitioners with limited compute. Limitations include tile-induced overhead and partial loss of global context, suggesting avenues for future improvements in high-resolution processing and scaling to larger models.

Abstract

We present Jina-VLM, a 2.4B parameter vision-language model that achieves state-of-the-art multilingual visual question answering among open 2B-scale VLMs. The model couples a SigLIP2 vision encoder with a Qwen3 language backbone through an attention-pooling connector that enables token-efficient processing of arbitrary-resolution images. The model achieves leading results on standard VQA benchmarks and multilingual evaluations while preserving competitive text-only performance. Model weights and code are publicly released at https://huggingface.co/jinaai/jina-vlm .

Paper Structure

This paper contains 20 sections, 4 equations, 11 figures, 8 tables, 1 algorithm.

Figures (11)

  • Figure 1: Architecture of https://huggingface.co/jinaai/jina-vlm. Images are resized to fit a grid of up to 12 overlapping tiles, plus a global thumbnail. Each tile is a square 378$\times$378 crop; adjacent tiles overlap by 112 pixels with a stride of 266 pixels between tile origins. A 4$\times$3 grid therefore spans 1176$\times$910 pixels, and images exceeding this effective resolution are downscaled to fit the tile budget. Each tile produces 729 patches via SigLIP2 siglip2. The VL connector concatenates features from layers 24 and 18, the third- and ninth-to-last layers, then applies 2$\times$2 attention pooling to reduce 729 tokens to 182 before projecting to the decoder dimension. Visual tokens are combined with text embeddings for the Qwen3 decoder qwen3.
  • Figure 2: Answer questions given web documents.
  • Figure 3: Financial table requiring numerical reasoning over text.
  • Figure 4: Document image with question about textual fields.
  • Figure 5: Photo with textual question needing OCR reading.
  • ...and 6 more figures