Table of Contents
Fetching ...

Benchmarking the Performance of Pre-trained LLMs across Urdu NLP Tasks

Munief Hassan Tahir, Sana Shams, Layba Fiaz, Farah Adeeba, Sarmad Hussain

TL;DR

This work benchmarks seven pre-trained LLMs across 17 Urdu NLP tasks using 22 public datasets and zero-shot prompts to compare against SOTA baselines. It demonstrates that SOTA models typically outperform encoder-decoder LLMs in Urdu tasks, but newer models with richer Urdu data coverage, such as Llama 3.1, can surpass prior SOTA on several tasks. GPT-3.5-turbo remains competitive, while Ministral 8B shows strength in detection-oriented tasks, highlighting the impact of prompt design and post-processing. The findings offer practical guidance for Urdu NLP practitioners on model selection, prompting, and the value of language diversity in pretraining.

Abstract

Large Language Models (LLMs) pre-trained on multilingual data have revolutionized natural language processing research, by transitioning from languages and task specific model pipelines to a single model adapted on a variety of tasks. However majority of existing multilingual NLP benchmarks for LLMs provide evaluation data in only few languages with little linguistic diversity. In addition these benchmarks lack quality assessment against the respective state-of the art models. This study presents an in-depth examination of 7 prominent LLMs: GPT-3.5-turbo, Llama 2-7B-Chat, Llama 3.1-8B, Bloomz 3B, Bloomz 7B1, Ministral-8B and Whisper (Large, medium and small variant) across 17 tasks using 22 datasets, 13.8 hours of speech, in a zero-shot setting, and their performance against state-of-the-art (SOTA) models, has been compared and analyzed. Our experiments show that SOTA models currently outperform encoder-decoder models in majority of Urdu NLP tasks under zero-shot settings. However, comparing Llama 3.1-8B over prior version Llama 2-7B-Chat, we can deduce that with improved language coverage, LLMs can surpass these SOTA models. Our results emphasize that models with fewer parameters but richer language-specific data, like Llama 3.1-8B, often outperform larger models with lower language diversity, such as GPT-3.5, in several tasks.

Benchmarking the Performance of Pre-trained LLMs across Urdu NLP Tasks

TL;DR

This work benchmarks seven pre-trained LLMs across 17 Urdu NLP tasks using 22 public datasets and zero-shot prompts to compare against SOTA baselines. It demonstrates that SOTA models typically outperform encoder-decoder LLMs in Urdu tasks, but newer models with richer Urdu data coverage, such as Llama 3.1, can surpass prior SOTA on several tasks. GPT-3.5-turbo remains competitive, while Ministral 8B shows strength in detection-oriented tasks, highlighting the impact of prompt design and post-processing. The findings offer practical guidance for Urdu NLP practitioners on model selection, prompting, and the value of language diversity in pretraining.

Abstract

Large Language Models (LLMs) pre-trained on multilingual data have revolutionized natural language processing research, by transitioning from languages and task specific model pipelines to a single model adapted on a variety of tasks. However majority of existing multilingual NLP benchmarks for LLMs provide evaluation data in only few languages with little linguistic diversity. In addition these benchmarks lack quality assessment against the respective state-of the art models. This study presents an in-depth examination of 7 prominent LLMs: GPT-3.5-turbo, Llama 2-7B-Chat, Llama 3.1-8B, Bloomz 3B, Bloomz 7B1, Ministral-8B and Whisper (Large, medium and small variant) across 17 tasks using 22 datasets, 13.8 hours of speech, in a zero-shot setting, and their performance against state-of-the-art (SOTA) models, has been compared and analyzed. Our experiments show that SOTA models currently outperform encoder-decoder models in majority of Urdu NLP tasks under zero-shot settings. However, comparing Llama 3.1-8B over prior version Llama 2-7B-Chat, we can deduce that with improved language coverage, LLMs can surpass these SOTA models. Our results emphasize that models with fewer parameters but richer language-specific data, like Llama 3.1-8B, often outperform larger models with lower language diversity, such as GPT-3.5, in several tasks.
Paper Structure (140 sections, 1 figure, 3 tables)

This paper contains 140 sections, 1 figure, 3 tables.

Figures (1)

  • Figure 1: The performance of different models in zero-shot scenario as compared to SOTA. Missing bars in some tasks mean that the specific model cannot perform the specified task.