Table of Contents
Fetching ...

Multimodal Large Language Models for Low-Resource Languages: A Case Study for Basque

Lukas Arana, Julen Etxaniz, Ander Salaberria, Gorka Azkune

TL;DR

This work addresses the scarcity of open Multimodal Large Language Models for a low-resource language by building Basque-focused datasets and evaluating two backbones, Llama-3.1-Instruct (English-centric) and Latxa (Basque-adapted), within a late-fusion MLLM framework. It demonstrates that roughly $20\%$ Basque multimodal data suffices to achieve strong Basque benchmarks and that an English-centric backbone can match Basque-centric performance, given appropriate data mixtures and text-only instructions. The authors release their datasets and provide a pathway for developing open MLLMs for other low-resource languages, while also highlighting the importance of human evaluation for open-ended generation and outlining limitations related to translation and data availability. Overall, the study shows cross-lingual transfer and data-efficient strategies as practical routes to broaden Multimodal LLM access in under-resourced linguistic contexts.

Abstract

Current Multimodal Large Language Models exhibit very strong performance for several demanding tasks. While commercial MLLMs deliver acceptable performance in low-resource languages, comparable results remain unattained within the open science community. In this paper, we aim to develop a strong MLLM for a low-resource language, namely Basque. For that purpose, we develop our own training and evaluation image-text datasets. Using two different Large Language Models as backbones, the Llama-3.1-Instruct model and a Basque-adapted variant called Latxa, we explore several data mixtures for training. We show that: i) low ratios of Basque multimodal data (around 20%) are already enough to obtain solid results on Basque benchmarks, and ii) contrary to expected, a Basque instructed backbone LLM is not required to obtain a strong MLLM in Basque. Our results pave the way to develop MLLMs for other low-resource languages by openly releasing our resources.

Multimodal Large Language Models for Low-Resource Languages: A Case Study for Basque

TL;DR

This work addresses the scarcity of open Multimodal Large Language Models for a low-resource language by building Basque-focused datasets and evaluating two backbones, Llama-3.1-Instruct (English-centric) and Latxa (Basque-adapted), within a late-fusion MLLM framework. It demonstrates that roughly Basque multimodal data suffices to achieve strong Basque benchmarks and that an English-centric backbone can match Basque-centric performance, given appropriate data mixtures and text-only instructions. The authors release their datasets and provide a pathway for developing open MLLMs for other low-resource languages, while also highlighting the importance of human evaluation for open-ended generation and outlining limitations related to translation and data availability. Overall, the study shows cross-lingual transfer and data-efficient strategies as practical routes to broaden Multimodal LLM access in under-resourced linguistic contexts.

Abstract

Current Multimodal Large Language Models exhibit very strong performance for several demanding tasks. While commercial MLLMs deliver acceptable performance in low-resource languages, comparable results remain unattained within the open science community. In this paper, we aim to develop a strong MLLM for a low-resource language, namely Basque. For that purpose, we develop our own training and evaluation image-text datasets. Using two different Large Language Models as backbones, the Llama-3.1-Instruct model and a Basque-adapted variant called Latxa, we explore several data mixtures for training. We show that: i) low ratios of Basque multimodal data (around 20%) are already enough to obtain solid results on Basque benchmarks, and ii) contrary to expected, a Basque instructed backbone LLM is not required to obtain a strong MLLM in Basque. Our results pave the way to develop MLLMs for other low-resource languages by openly releasing our resources.

Paper Structure

This paper contains 31 sections, 7 figures, 2 tables.

Figures (7)

  • Figure 1: The late-fusion MLLM architecture used in this work. Our MLLMs have a visual encoder to represent input images, a connector to project visual representations into the embedding space of the LLM, and an LLM to process image inputs and textual queries to generate textual answers.
  • Figure 2: Performance across multimodal benchmarks of Latxa-based (top row) and Llama-based (bottom row) MLLMs trained with different percentages of Basque Multimodal Instruction Data. The models are evaluated on the English (original) and Basque (translated) versions of close-ended benchmarks.
  • Figure 3: Performance across Basque multimodal (left) and text-only (right) benchmarks of Latxa-based (top row) and Llama-based (bottom row) MLLMs trained with different percentages of Basque Multimodal Instruction Data. The models are evaluated on the English (original) and Basque (translated) versions.
  • Figure 4: Two-shot prompting procedure for English to Basque translation of the A-OKVQAEus benchmark. example['question'] and example['answer'] correspond to the question and answer to be translated in the example.
  • Figure 5: Two-shot prompting procedure for English to Basque translation of the PixMo-CapQAEus benchmark. example['question'] and example['answer'] correspond to the question and answer to be translated in the example.
  • ...and 2 more figures