Table of Contents
Fetching ...

PALO: A Polyglot Large Multimodal Model for 5B People

Muhammad Maaz, Hanoona Rasheed, Abdelrahman Shaker, Salman Khan, Hisham Cholakal, Rao M. Anwer, Tim Baldwin, Michael Felsberg, Fahad S. Khan

TL;DR

Palo addresses the scarcity of multilingual vision-language capabilities by introducing a polyglot Large Multimodal Model that operates across ten major languages. It uses a semi-automated translation pipeline to create a large multilingual instruction-tuning dataset and trains three scalable variants (1.7B mobile, 7B, and 13B) with Vicuna/MobileLLaMA backbones and a CLIP-based vision encoder. Evaluation on a translated multilingual benchmark demonstrates substantial gains in both high-resource and underrepresented languages, underscoring the method's ability to improve cross-lingual visual reasoning while maintaining performance in English and other high-resource languages. The work highlights the importance of open-source multilingual VLMs and provides a replicable pipeline for expanding linguistic coverage in vision-language systems.

Abstract

In pursuit of more inclusive Vision-Language Models (VLMs), this study introduces a Large Multilingual Multimodal Model called PALO. PALO offers visual reasoning capabilities in 10 major languages, including English, Chinese, Hindi, Spanish, French, Arabic, Bengali, Russian, Urdu, and Japanese, that span a total of ~5B people (65% of the world population). Our approach involves a semi-automated translation approach to adapt the multimodal instruction dataset from English to the target languages using a fine-tuned Large Language Model, thereby ensuring high linguistic fidelity while allowing scalability due to minimal manual effort. The incorporation of diverse instruction sets helps us boost overall performance across multiple languages especially those that are underrepresented like Hindi, Arabic, Bengali, and Urdu. The resulting models are trained across three scales (1.7B, 7B and 13B parameters) to show the generalization and scalability where we observe substantial improvements compared to strong baselines. We also propose the first multilingual multimodal benchmark for the forthcoming approaches to evaluate their vision-language reasoning capabilities across languages. Code: https://github.com/mbzuai-oryx/PALO.

PALO: A Polyglot Large Multimodal Model for 5B People

TL;DR

Palo addresses the scarcity of multilingual vision-language capabilities by introducing a polyglot Large Multimodal Model that operates across ten major languages. It uses a semi-automated translation pipeline to create a large multilingual instruction-tuning dataset and trains three scalable variants (1.7B mobile, 7B, and 13B) with Vicuna/MobileLLaMA backbones and a CLIP-based vision encoder. Evaluation on a translated multilingual benchmark demonstrates substantial gains in both high-resource and underrepresented languages, underscoring the method's ability to improve cross-lingual visual reasoning while maintaining performance in English and other high-resource languages. The work highlights the importance of open-source multilingual VLMs and provides a replicable pipeline for expanding linguistic coverage in vision-language systems.

Abstract

In pursuit of more inclusive Vision-Language Models (VLMs), this study introduces a Large Multilingual Multimodal Model called PALO. PALO offers visual reasoning capabilities in 10 major languages, including English, Chinese, Hindi, Spanish, French, Arabic, Bengali, Russian, Urdu, and Japanese, that span a total of ~5B people (65% of the world population). Our approach involves a semi-automated translation approach to adapt the multimodal instruction dataset from English to the target languages using a fine-tuned Large Language Model, thereby ensuring high linguistic fidelity while allowing scalability due to minimal manual effort. The incorporation of diverse instruction sets helps us boost overall performance across multiple languages especially those that are underrepresented like Hindi, Arabic, Bengali, and Urdu. The resulting models are trained across three scales (1.7B, 7B and 13B parameters) to show the generalization and scalability where we observe substantial improvements compared to strong baselines. We also propose the first multilingual multimodal benchmark for the forthcoming approaches to evaluate their vision-language reasoning capabilities across languages. Code: https://github.com/mbzuai-oryx/PALO.
Paper Structure (15 sections, 5 figures, 2 tables)

This paper contains 15 sections, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Palo vs. English-VLMs. The plot compares Palo with corresponding Vision-Language Models (VLMs) across 10 different languages. These languages include English, Chinese, Hindi, Spanish, French, Arabic, Bengali, Russian, Urdu, and Japanese, collectively covering approximately 5B people and 65% of the global population. English-trained VLMs, such as LLaVA and MobileVLM, exhibit poor performance on low-resource languages including Hindi, Arabic, Bengali, and Urdu, due to the under-representation of these languages during their training phases. Palo, in contrast, is a unified model that can hold conversations simultaneously in all the ten languages, demonstrating consistent performance across the board.
  • Figure 2: Architecture overview of Palo. (left) The model consists of a vision encoder that encodes the image, followed by a projector that projects the vision features into the input embedding space of the language model. The user's text query is tokenized, and the tokens are concatenated with the vision tokens before being input into the causal language model to generate the response. For the Palo 7B and 13B variants, Vicuna is used as the Large Language Model while MobileLLaMA chu2023mobilevlm is used as the Small Language Model in our MobilePalo-1.7B variant. CLIP ViT-L/336px is used as the vision encoder in all variants. (right) Projectors used in different variants of Palo are shown. For the Palo 7B and 13B, following liu2023llava, we use a two-layer MLP projector with GELU activation. For our mobile version of Palo (MobilePalo-1.7B), we use a Lightweight Downsample Projector (LDP) from chu2023mobilevlm. It utilizes depth-wise separable convolutions to downsample the image tokens, making it faster than a standard MLP projector.
  • Figure 3: Qualitative results showing the impact of fine-tuning. Comparative visualization of English to Arabic translations before and after fine-tuning the LLM. The figure shows improvements in language-specific issues such as accurate vocabulary usage, gender agreement, and grammatical correctness, highlighting the enhanced performance of the fine-tuned model.
  • Figure 4: Qualitative results demonstrating the multilingual capabilities of Palo. When presented with user queries, the model generates accurate textual responses related to the visual content and the relevant language. The figure highlights its ability to bridge vision and language understanding across diverse languages. In this illustration, we explore dialogues in two high-resource languages—Spanish and Chinese—and two low-resource languages—Hindi and Arabic. Palo accurately interprets the unusual aspects of an image featuring two individuals in medieval attire within a contemporary supermarket setting. The model exhibits its creative imagination in Chinese, proposing a backstory where these characters might be a king and queen from a storybook. In Hindi, Palo demonstrates scenario-building by describing a possible situation that brought the medieval couple into the current day as time travellers. At the bottom, Palo displays a touch of humour in Arabic, conjuring up a playful dialogue that a king might say, showcasing its subtle understanding of context and culture-specific humour. This image effectively visualizes the advanced ability to process and generate content in multiple languages, reflecting high linguistic precision and cultural intelligence.
  • Figure 5: Qualitative results demonstrating the visual reasoning of Palo and its adeptness in multiple languages. Palo responds accurately to visual content in a contextually appropriate manner for each language. We illustrate a conversation in three high-resource languages—French, Russian and Japanese and one low-resource language—Urdu. In the French segment, the model shows practical reasoning by suggesting a recipe that utilizes the available ingredients in the fridge, connecting visual perception to culinary suggestions. In Russian, Palo identifies items rich in Vitamin C and in the Urdu example, the model organizes the fridge contents into food groups, demonstrating its ability to classify items and apply nutritional knowledge. This effectively highlights its ability to switch between languages while maintaining the context of the conversation, reflecting its capacity to generate relevant and culturally aware content in both high-resource and low-resource languages.