Table of Contents
Fetching ...

Multilingual State Space Models for Structured Question Answering in Indic Languages

Arpita Vats, Rahul Raja, Mrinal Mathur, Vinija Jain, Aman Chadha

TL;DR

This work addresses QA in Indic languages by applying State Space Models (SSMs) to capture long- and short-range dependencies with linear-time processing, offering a scalable alternative to Transformers for long-context QA. It implements a comprehensive pipeline (tokenization with IndicNLP, SSM-based models, and LoRA fine-tuning) and evaluates multiple SSM variants on Hindi and Marathi using a SQuAD-style IndicQA dataset, establishing a first benchmark in this domain. Key findings show that fine-tuning, particularly with Mamba-2, yields strong improvements in span localization and semantic alignment, narrowing the gap between languages and demonstrating the viability of low-resource multilingual QA with SSMs. The work implies significant practical impact for deploying efficient, multilingual QA systems in resource-constrained settings, and lays groundwork for expanding to more dialects and unified multilingual models, with attention to data diversity and tokenization fidelity. $N$-dimensional state dynamics and $O(n)$ sequence processing underpin the efficiency advantages of SSMs over Transformer-based approaches in this context.

Abstract

The diversity and complexity of Indic languages present unique challenges for natural language processing (NLP) tasks, particularly in the domain of question answering (QA).To address these challenges, this paper explores the application of State Space Models (SSMs),to build efficient and contextually aware QA systems tailored for Indic languages. SSMs are particularly suited for this task due to their ability to model long-term and short-term dependencies in sequential data, making them well-equipped to handle the rich morphology, complex syntax, and contextual intricacies characteristic of Indian languages. We evaluated multiple SSM architectures across diverse datasets representing various Indic languages and conducted a comparative analysis of their performance. Our results demonstrate that these models effectively capture linguistic subtleties, leading to significant improvements in question interpretation, context alignment, and answer generation. This work represents the first application of SSMs to question answering tasks in Indic languages, establishing a foundational benchmark for future research in this domain. We propose enhancements to existing SSM frameworks, optimizing their applicability to low-resource settings and multilingual scenarios prevalent in Indic languages.

Multilingual State Space Models for Structured Question Answering in Indic Languages

TL;DR

This work addresses QA in Indic languages by applying State Space Models (SSMs) to capture long- and short-range dependencies with linear-time processing, offering a scalable alternative to Transformers for long-context QA. It implements a comprehensive pipeline (tokenization with IndicNLP, SSM-based models, and LoRA fine-tuning) and evaluates multiple SSM variants on Hindi and Marathi using a SQuAD-style IndicQA dataset, establishing a first benchmark in this domain. Key findings show that fine-tuning, particularly with Mamba-2, yields strong improvements in span localization and semantic alignment, narrowing the gap between languages and demonstrating the viability of low-resource multilingual QA with SSMs. The work implies significant practical impact for deploying efficient, multilingual QA systems in resource-constrained settings, and lays groundwork for expanding to more dialects and unified multilingual models, with attention to data diversity and tokenization fidelity. -dimensional state dynamics and sequence processing underpin the efficiency advantages of SSMs over Transformer-based approaches in this context.

Abstract

The diversity and complexity of Indic languages present unique challenges for natural language processing (NLP) tasks, particularly in the domain of question answering (QA).To address these challenges, this paper explores the application of State Space Models (SSMs),to build efficient and contextually aware QA systems tailored for Indic languages. SSMs are particularly suited for this task due to their ability to model long-term and short-term dependencies in sequential data, making them well-equipped to handle the rich morphology, complex syntax, and contextual intricacies characteristic of Indian languages. We evaluated multiple SSM architectures across diverse datasets representing various Indic languages and conducted a comparative analysis of their performance. Our results demonstrate that these models effectively capture linguistic subtleties, leading to significant improvements in question interpretation, context alignment, and answer generation. This work represents the first application of SSMs to question answering tasks in Indic languages, establishing a foundational benchmark for future research in this domain. We propose enhancements to existing SSM frameworks, optimizing their applicability to low-resource settings and multilingual scenarios prevalent in Indic languages.

Paper Structure

This paper contains 22 sections, 1 equation, 9 figures, 4 tables.

Figures (9)

  • Figure 1: Workflow of the fine-tuned SSM model for multilingual question-answering tasks, illustrating the complete framework.
  • Figure 2: Illustration of the data preprocessing pipeline.
  • Figure 3: Examples of structured question-answer pairs in Marathi used for training or evaluating the QA system.
  • Figure 4: Validation plots for SSM Models
  • Figure 5: Question Length vs Answer Length. This plot compares the lengths of questions and answers in the Hindi (blue) and Marathi (red) datasets, highlighting a correlation between them."
  • ...and 4 more figures