Table of Contents
Fetching ...

Chitrarth: Bridging Vision and Language for a Billion People

Shaharukh Khan, Ayush Tarun, Abhinav Ravi, Ali Faraz, Akshat Patidar, Praveen Kumar Pokala, Anagha Bhangare, Raja Kolla, Chandra Khatri, Shubham Agarwal

TL;DR

Chitrarth addresses the language diversity gap in vision-language models by introducing a multilingual VLM backbone (Krutrim) integrated with a vision encoder to support 10 Indic languages. It employs a two-stage training pipeline—Stage 1 with translated image-caption data for visual-language alignment and Stage 2 with multilingual instruction tuning on diverse data—to enable cross-language, multimodal conversations. The authors also present BharatBench, a benchmark suite extending multimodal evaluation to low-resource Indian languages, and demonstrate SOTA performance on several English benchmarks alongside robust multilingual capabilities. The work emphasizes data-centric multilingual alignment, provides open evaluation resources, and highlights practical implications for deploying inclusive AI across a billion-language population. Limitations include translation-induced biases, with future work aiming to unfreeze the vision encoder, use higher-resolution vision models, and broaden language coverage.

Abstract

Recent multimodal foundation models are primarily trained on English or high resource European language data, which hinders their applicability to other medium and low-resource languages. To address this limitation, we introduce Chitrarth (Chitra: Image; Artha: Meaning), an inclusive Vision-Language Model (VLM), specifically targeting the rich linguistic diversity and visual reasoning across 10 prominent Indian languages. Our model effectively integrates a state-of-the-art (SOTA) multilingual Large Language Model (LLM) with a vision module, primarily trained on multilingual image-text data. Furthermore, we also introduce BharatBench, a comprehensive framework for evaluating VLMs across various Indian languages, ultimately contributing to more diverse and effective AI systems. Our model achieves SOTA results for benchmarks across low resource languages while retaining its efficiency in English. Through our research, we aim to set new benchmarks in multilingual-multimodal capabilities, offering substantial improvements over existing models and establishing a foundation to facilitate future advancements in this arena.

Chitrarth: Bridging Vision and Language for a Billion People

TL;DR

Chitrarth addresses the language diversity gap in vision-language models by introducing a multilingual VLM backbone (Krutrim) integrated with a vision encoder to support 10 Indic languages. It employs a two-stage training pipeline—Stage 1 with translated image-caption data for visual-language alignment and Stage 2 with multilingual instruction tuning on diverse data—to enable cross-language, multimodal conversations. The authors also present BharatBench, a benchmark suite extending multimodal evaluation to low-resource Indian languages, and demonstrate SOTA performance on several English benchmarks alongside robust multilingual capabilities. The work emphasizes data-centric multilingual alignment, provides open evaluation resources, and highlights practical implications for deploying inclusive AI across a billion-language population. Limitations include translation-induced biases, with future work aiming to unfreeze the vision encoder, use higher-resolution vision models, and broaden language coverage.

Abstract

Recent multimodal foundation models are primarily trained on English or high resource European language data, which hinders their applicability to other medium and low-resource languages. To address this limitation, we introduce Chitrarth (Chitra: Image; Artha: Meaning), an inclusive Vision-Language Model (VLM), specifically targeting the rich linguistic diversity and visual reasoning across 10 prominent Indian languages. Our model effectively integrates a state-of-the-art (SOTA) multilingual Large Language Model (LLM) with a vision module, primarily trained on multilingual image-text data. Furthermore, we also introduce BharatBench, a comprehensive framework for evaluating VLMs across various Indian languages, ultimately contributing to more diverse and effective AI systems. Our model achieves SOTA results for benchmarks across low resource languages while retaining its efficiency in English. Through our research, we aim to set new benchmarks in multilingual-multimodal capabilities, offering substantial improvements over existing models and establishing a foundation to facilitate future advancements in this arena.

Paper Structure

This paper contains 12 sections, 8 figures, 1 table.

Figures (8)

  • Figure 1: Multi lingual capability of Chitrarth model across major Indian languages. For the same underlying image, we present question-answer pairs in English and several Indian languages - Gujarati, Kannada, Hindi, Marathi, Telugu, Tamil, Malayalam, and Bengali (in order). Questions are highlighted in purple, and responses are shown in orange (provided with English translations). The model accurately understands and identifies the 'image of a saint writing a book with a feather' and correctly addresses related questions in different languages.
  • Figure 2: Chitrarth model features a fully autoregressive architecture with a two-stage training process. In Stage 1, the model is trained using images and their descriptions, aligning visual and linguistic embeddings through image-caption pairs. In Stage 2, model is fine-tuned on multimodal instruction-following and domain-specific academic datasets.
  • Figure 3: Language distribution in data mix. (a) Stage 1 data consists of 1.2M ShareGPT4V in the original English version (650K) and remaining Indian language translations (65K each) (b) Stage 2 data involves 879K samples in English and 88K for each respective language, discussed in Section \ref{['sec:dataset']}.
  • Figure 4: Multilingual VLM Capabilities. Our model demonstrates robust performance across various languages in: a) Creative writing, b) Fine-grained attribute extraction, c) Explaining scientific diagrams, d) Screen reading/OCR, e) Anomaly and hazard detection, and f) Real-time accident and incident monitoring.
  • Figure 5: Performance against SOTA VLMs on different academic multimodal tasks. Our model consistenly outperforms IDEFICS 2 (7B) and PALO 7B on different benchmarks while remaining competitive on TextVQA and Vizwiz.
  • ...and 3 more figures