Tourism Question Answer System in Indian Language using Domain-Adapted Foundation Models

Praveen Gatla; Anushka; Nikita Kanwar; Gouri Sahoo; Rajesh Kumar Mundotiya

Tourism Question Answer System in Indian Language using Domain-Adapted Foundation Models

Praveen Gatla, Anushka, Nikita Kanwar, Gouri Sahoo, Rajesh Kumar Mundotiya

TL;DR

The study builds a baseline extractive QA system for Hindi tourism in Varanasi by creating a large, domain-spanning Hindi QA dataset and evaluating BERT/RoBERTa models with supervised fine-tuning and LoRA-based parameter-efficient tuning. It demonstrates that Hindi-specific pretraining (HindiBERT, HindiRoBERTa) generally outperforms multilingual bases, with RoBERTa + SFT delivering strong domain performance and LoRA offering substantial parameter reduction while maintaining competitive results in several subdomains. The work provides a foundational Hindi tourism QA benchmark and insights into model selection for low-resource, culturally nuanced domains, highlighting the balance between accuracy and efficiency. It also points to future integration with retrieval-augmented generation (RAG) and expansion to additional Indian-language domains to enhance accessibility for visitors and researchers alike.

Abstract

This article presents the first comprehensive study on designing a baseline extractive question-answering (QA) system for the Hindi tourism domain, with a specialized focus on the Varanasi-a cultural and spiritual hub renowned for its Bhakti-Bhaav (devotional ethos). Targeting ten tourism-centric subdomains-Ganga Aarti, Cruise, Food Court, Public Toilet, Kund, Museum, General, Ashram, Temple and Travel, the work addresses the absence of language-specific QA resources in Hindi for culturally nuanced applications. In this paper, a dataset comprising 7,715 Hindi QA pairs pertaining to Varanasi tourism was constructed and subsequently augmented with 27,455 pairs generated via Llama zero-shot prompting. We propose a framework leveraging foundation models-BERT and RoBERTa, fine-tuned using Supervised Fine-Tuning (SFT) and Low-Rank Adaptation (LoRA), to optimize parameter efficiency and task performance. Multiple variants of BERT, including pre-trained languages (e.g., Hindi-BERT), are evaluated to assess their suitability for low-resource domain-specific QA. Evaluation metrics - F1, BLEU, and ROUGE-L - highlight trade-offs between answer precision and linguistic fluency. Experiments demonstrate that LoRA-based fine-tuning achieves competitive performance (85.3\% F1) while reducing trainable parameters by 98\% compared to SFT, striking a balance between efficiency and accuracy. Comparative analysis across models reveals that RoBERTa with SFT outperforms BERT variants in capturing contextual nuances, particularly for culturally embedded terms (e.g., Aarti, Kund). This work establishes a foundational baseline for Hindi tourism QA systems, emphasizing the role of LORA in low-resource settings and underscoring the need for culturally contextualized NLP frameworks in the tourism domain.

Tourism Question Answer System in Indian Language using Domain-Adapted Foundation Models

TL;DR

Abstract

Tourism Question Answer System in Indian Language using Domain-Adapted Foundation Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)