Table of Contents
Fetching ...

Adapting Small Language Models to Low-Resource Domains: A Case Study in Hindi Tourism QA

Sandipan Majhi, Paheli Bhattacharya

TL;DR

This work tackles domain-specific QA in a low-resource language (Hindi) by augmenting a small language model with synthetic QA pairs generated by large LLMs. It adopts a multi-stage finetuning strategy, evaluating Baseline, Continued Finetuning, and Multi-Source Finetuning on the VATIKA Hindi tourism dataset, including a held-out Test Data-2 to measure robustness. The results show that synthetic data can boost performance and robustness for lightweight models, with late exposure to synthetic data (Continued Finetuning) often yielding the best out-of-sample results, while indiscriminate mixing of sources can degrade certain metrics. The study provides practical insights for scalable domain adaptation in low-resource settings and highlights the need for synthetic-data quality control, potentially via LLM-based filtering, to extend to other languages and domains.

Abstract

Domain-specific question answering in low-resource languages faces two key challenges: scarcity of annotated datasets and limited domain knowledge in general-purpose language models. In this work, we present a multi-stage finetuning strategy to adapt lightweight language models to the Hindi tourism domain by leveraging both original and synthetic training data. Synthetic question-answer pairs are generated using large LLMs (LLaMA-70B, Phi-14B) and used to augment the limited original dataset. We explore several training methodologies and analyse their impact on domain generalisation. Our results demonstrate that large models can efficiently generate synthetic data, while small models can effectively adapt to it, offering a scalable pathway for low-resource, domain-specific QA.

Adapting Small Language Models to Low-Resource Domains: A Case Study in Hindi Tourism QA

TL;DR

This work tackles domain-specific QA in a low-resource language (Hindi) by augmenting a small language model with synthetic QA pairs generated by large LLMs. It adopts a multi-stage finetuning strategy, evaluating Baseline, Continued Finetuning, and Multi-Source Finetuning on the VATIKA Hindi tourism dataset, including a held-out Test Data-2 to measure robustness. The results show that synthetic data can boost performance and robustness for lightweight models, with late exposure to synthetic data (Continued Finetuning) often yielding the best out-of-sample results, while indiscriminate mixing of sources can degrade certain metrics. The study provides practical insights for scalable domain adaptation in low-resource settings and highlights the need for synthetic-data quality control, potentially via LLM-based filtering, to extend to other languages and domains.

Abstract

Domain-specific question answering in low-resource languages faces two key challenges: scarcity of annotated datasets and limited domain knowledge in general-purpose language models. In this work, we present a multi-stage finetuning strategy to adapt lightweight language models to the Hindi tourism domain by leveraging both original and synthetic training data. Synthetic question-answer pairs are generated using large LLMs (LLaMA-70B, Phi-14B) and used to augment the limited original dataset. We explore several training methodologies and analyse their impact on domain generalisation. Our results demonstrate that large models can efficiently generate synthetic data, while small models can effectively adapt to it, offering a scalable pathway for low-resource, domain-specific QA.

Paper Structure

This paper contains 7 sections, 1 figure, 5 tables.

Figures (1)

  • Figure 1: An overview of the two-phased experimental procedure, including synthetic data generation, followed by mixed fine-tuning of a smaller language model on the augmented Hindi-language dataset.