Zero- and Few-Shot Prompting with LLMs: A Comparative Study with Fine-tuned Models for Bangla Sentiment Analysis
Md. Arid Hasan, Shudipta Das, Afiyat Anjum, Firoj Alam, Anika Anjum, Avijit Sarker, Sheak Rashed Haider Noori
TL;DR
This study tackles Bangla sentiment analysis in a low-resource setting by introducing MUBASE, a large manually annotated dataset of 33,606 Bangla social-media posts, and by benchmarking zero- and few-shot prompting of LLMs (GPT-4, BLOOMZ, Flan-T5) against fine-tuned models. It systematically compares baselines, classical models, small language models (e.g., BanglaBERT), GPT embeddings, and various LLM prompting strategies, using strict evaluation metrics and a robust data split. The findings show that fine-tuned monolingual models consistently outperform zero-/few-shot LLM prompting, though LLMs can be competitive and are valuable when data or resources for fine-tuning are limited; ensemble techniques offer notable gains. The work provides a publicly available resource and a clear benchmark for future Bangla NLP research, highlighting the continued value of domain-specific fine-tuning and the potential of native-language prompts for in-language sentiment tasks.
Abstract
The rapid expansion of the digital world has propelled sentiment analysis into a critical tool across diverse sectors such as marketing, politics, customer service, and healthcare. While there have been significant advancements in sentiment analysis for widely spoken languages, low-resource languages, such as Bangla, remain largely under-researched due to resource constraints. Furthermore, the recent unprecedented performance of Large Language Models (LLMs) in various applications highlights the need to evaluate them in the context of low-resource languages. In this study, we present a sizeable manually annotated dataset encompassing 33,606 Bangla news tweets and Facebook comments. We also investigate zero- and few-shot in-context learning with several language models, including Flan-T5, GPT-4, and Bloomz, offering a comparative analysis against fine-tuned models. Our findings suggest that monolingual transformer-based models consistently outperform other models, even in zero and few-shot scenarios. To foster continued exploration, we intend to make this dataset and our research tools publicly available to the broader research community.
