Table of Contents
Fetching ...

Generalists vs. Specialists: Evaluating Large Language Models for Urdu

Samee Arif, Abdul Hameed Azeemi, Agha Ali Raza, Awais Athar

TL;DR

The paper evaluates generalist versus specialist large language models on 14 Urdu tasks (7 classification, 7 generation), finding that task-specific fine-tuning substantially boosts performance in Urdu NLP. Across both evaluation modalities, specialists (XLM-R, mT5, Llama-FT) generally outperform generalists, though human judgments sometimes favor generalist GPT-4-Turbo for generation tasks, revealing gaps between automated metrics and human perception. The work highlights the critical role of native Urdu data, careful prompt design, and the limitations of current Urdu datasets, arguing for broader native-resource development and more nuanced evaluation frameworks. Overall, the study provides practical guidance for deploying Urdu NLP systems and underscores the value of combining quantitative metrics with human evaluation in low-resource languages.

Abstract

In this paper, we compare general-purpose models, GPT-4-Turbo and Llama-3-8b, with special-purpose models--XLM-Roberta-large, mT5-large, and Llama-3-8b--that have been fine-tuned on specific tasks. We focus on seven classification and seven generation tasks to evaluate the performance of these models on Urdu language. Urdu has 70 million native speakers, yet it remains underrepresented in Natural Language Processing (NLP). Despite the frequent advancements in Large Language Models (LLMs), their performance in low-resource languages, including Urdu, still needs to be explored. We also conduct a human evaluation for the generation tasks and compare the results with the evaluations performed by GPT-4-Turbo, Llama-3-8b and Claude 3.5 Sonnet. We find that special-purpose models consistently outperform general-purpose models across various tasks. We also find that the evaluation done by GPT-4-Turbo for generation tasks aligns more closely with human evaluation compared to the evaluation the evaluation done by Llama-3-8b. This paper contributes to the NLP community by providing insights into the effectiveness of general and specific-purpose LLMs for low-resource languages.

Generalists vs. Specialists: Evaluating Large Language Models for Urdu

TL;DR

The paper evaluates generalist versus specialist large language models on 14 Urdu tasks (7 classification, 7 generation), finding that task-specific fine-tuning substantially boosts performance in Urdu NLP. Across both evaluation modalities, specialists (XLM-R, mT5, Llama-FT) generally outperform generalists, though human judgments sometimes favor generalist GPT-4-Turbo for generation tasks, revealing gaps between automated metrics and human perception. The work highlights the critical role of native Urdu data, careful prompt design, and the limitations of current Urdu datasets, arguing for broader native-resource development and more nuanced evaluation frameworks. Overall, the study provides practical guidance for deploying Urdu NLP systems and underscores the value of combining quantitative metrics with human evaluation in low-resource languages.

Abstract

In this paper, we compare general-purpose models, GPT-4-Turbo and Llama-3-8b, with special-purpose models--XLM-Roberta-large, mT5-large, and Llama-3-8b--that have been fine-tuned on specific tasks. We focus on seven classification and seven generation tasks to evaluate the performance of these models on Urdu language. Urdu has 70 million native speakers, yet it remains underrepresented in Natural Language Processing (NLP). Despite the frequent advancements in Large Language Models (LLMs), their performance in low-resource languages, including Urdu, still needs to be explored. We also conduct a human evaluation for the generation tasks and compare the results with the evaluations performed by GPT-4-Turbo, Llama-3-8b and Claude 3.5 Sonnet. We find that special-purpose models consistently outperform general-purpose models across various tasks. We also find that the evaluation done by GPT-4-Turbo for generation tasks aligns more closely with human evaluation compared to the evaluation the evaluation done by Llama-3-8b. This paper contributes to the NLP community by providing insights into the effectiveness of general and specific-purpose LLMs for low-resource languages.
Paper Structure (31 sections, 12 figures, 9 tables)

This paper contains 31 sections, 12 figures, 9 tables.

Figures (12)

  • Figure 1: PoS Data Structure for Llama
  • Figure 2: CoT example from abuse detection
  • Figure 3: Classification Prompt Template
  • Figure 4: Classification Prompt Example
  • Figure 5: Classification Prompt Template (CoT)
  • ...and 7 more figures