Table of Contents
Fetching ...

Chitranuvad: Adapting Multi-Lingual LLMs for Multimodal Translation

Shaharukh Khan, Ayush Tarun, Ali Faraz, Palash Kamble, Vivek Dahiya, Praveen Pokala, Ashish Kulkarni, Chandra Khatri, Abhinav Ravi, Shubham Agarwal

TL;DR

Chitranuvad presents a unified, multilingual multimodal translation system that grounds English-to-Indic translations in visual context by fusing a ViT-based image encoder with a pre-trained multilingual LLM backbone. The model employs a three-stage training pipeline—feature alignment, instruction tuning, and task-specific fine-tuning—with experiments exploring single- and multi-layer modality projections and multiple vision encoders, achieving state-of-the-art results for Hindi and competitive performance for Malayalam and Bengali. Data augmentation spans translated image-text corpora and Visual Genome alignments, enabling robust grounded translation across three Indic languages. Despite modest observed gains from the vision stream in some settings, the approach demonstrates strong cross-track performance and highlights the benefits of multilingual pretraining for zero-shot translation in multimodal contexts.

Abstract

In this work, we provide the system description of our submission as part of the English to Lowres Multimodal Translation Task at the Workshop on Asian Translation (WAT2024). We introduce Chitranuvad, a multimodal model that effectively integrates Multilingual LLM and a vision module for Multimodal Translation. Our method uses a ViT image encoder to extract visual representations as visual token embeddings which are projected to the LLM space by an adapter layer and generates translation in an autoregressive fashion. We participated in all the three tracks (Image Captioning, Text only and Multimodal translation tasks) for Indic languages (ie. English translation to Hindi, Bengali and Malyalam) and achieved SOTA results for Hindi in all of them on the Challenge set while remaining competitive for the other languages in the shared task.

Chitranuvad: Adapting Multi-Lingual LLMs for Multimodal Translation

TL;DR

Chitranuvad presents a unified, multilingual multimodal translation system that grounds English-to-Indic translations in visual context by fusing a ViT-based image encoder with a pre-trained multilingual LLM backbone. The model employs a three-stage training pipeline—feature alignment, instruction tuning, and task-specific fine-tuning—with experiments exploring single- and multi-layer modality projections and multiple vision encoders, achieving state-of-the-art results for Hindi and competitive performance for Malayalam and Bengali. Data augmentation spans translated image-text corpora and Visual Genome alignments, enabling robust grounded translation across three Indic languages. Despite modest observed gains from the vision stream in some settings, the approach demonstrates strong cross-track performance and highlights the benefits of multilingual pretraining for zero-shot translation in multimodal contexts.

Abstract

In this work, we provide the system description of our submission as part of the English to Lowres Multimodal Translation Task at the Workshop on Asian Translation (WAT2024). We introduce Chitranuvad, a multimodal model that effectively integrates Multilingual LLM and a vision module for Multimodal Translation. Our method uses a ViT image encoder to extract visual representations as visual token embeddings which are projected to the LLM space by an adapter layer and generates translation in an autoregressive fashion. We participated in all the three tracks (Image Captioning, Text only and Multimodal translation tasks) for Indic languages (ie. English translation to Hindi, Bengali and Malyalam) and achieved SOTA results for Hindi in all of them on the Challenge set while remaining competitive for the other languages in the shared task.

Paper Structure

This paper contains 10 sections, 3 figures, 7 tables.

Figures (3)

  • Figure 1: Multimodal Machine Translation task as part of English-to-lowres track where the source sentence is translated to multiple Indic languages (Hindi, Bengali, Malayalam) grounded in the image. Meaning of words like "court" and "right" in the translations can vary significantly depending on the visual context.
  • Figure 2: Chitranuvad model architecture with the three stage training pipeline described in Section \ref{['sec:model']}.
  • Figure 3: English-to-lowres Multimodal Machine Translation track supports translation of source sentence into multiple Indic languages (Hindi, Bengali, Malayalam). We enrich the dataset to include labels of all the identified objects. We show the outputs of our best model which is trained with a mix of multi-lingual data in all the 3 stages.