Table of Contents
Fetching ...

Analyzing and Adapting Large Language Models for Few-Shot Multilingual NLU: Are We There Yet?

Evgeniia Razumovskaia, Ivan Vulić, Anna Korhonen

TL;DR

The paper systematically compares SFT, SIT, and ICL for few-shot multilingual NLU across high- and low-resource languages and tasks (intent detection, value extraction, and NLI), evaluating both performance and practical costs. It finds that supervised approaches (SFT/SIT) generally outperform ICL in accuracy while also being more cost-efficient, with SIT offering the best overall trade-off. The work further analyzes target-language adaptation of LLMs, showing generation fluency improves after adaptation but actual NLU gains remain limited, especially for low-resource languages. Overall, the study highlights the critical role of multilingual pretraining and instruction-tuning while underscoring the need for improved language-adaptation strategies and more multilingual, multitask pretraining to bridge the gap beyond English-centric models.

Abstract

Supervised fine-tuning (SFT), supervised instruction tuning (SIT) and in-context learning (ICL) are three alternative, de facto standard approaches to few-shot learning. ICL has gained popularity recently with the advent of LLMs due to its simplicity and sample efficiency. Prior research has conducted only limited investigation into how these approaches work for multilingual few-shot learning, and the focus so far has been mostly on their performance. In this work, we present an extensive and systematic comparison of the three approaches, testing them on 6 high- and low-resource languages, three different NLU tasks, and a myriad of language and domain setups. Importantly, performance is only one aspect of the comparison, where we also analyse the approaches through the optics of their computational, inference and financial costs. Our observations show that supervised instruction tuning has the best trade-off between performance and resource requirements. As another contribution, we analyse the impact of target language adaptation of pretrained LLMs and find that the standard adaptation approaches can (superficially) improve target language generation capabilities, but language understanding elicited through ICL does not improve and remains limited, with low scores especially for low-resource languages.

Analyzing and Adapting Large Language Models for Few-Shot Multilingual NLU: Are We There Yet?

TL;DR

The paper systematically compares SFT, SIT, and ICL for few-shot multilingual NLU across high- and low-resource languages and tasks (intent detection, value extraction, and NLI), evaluating both performance and practical costs. It finds that supervised approaches (SFT/SIT) generally outperform ICL in accuracy while also being more cost-efficient, with SIT offering the best overall trade-off. The work further analyzes target-language adaptation of LLMs, showing generation fluency improves after adaptation but actual NLU gains remain limited, especially for low-resource languages. Overall, the study highlights the critical role of multilingual pretraining and instruction-tuning while underscoring the need for improved language-adaptation strategies and more multilingual, multitask pretraining to bridge the gap beyond English-centric models.

Abstract

Supervised fine-tuning (SFT), supervised instruction tuning (SIT) and in-context learning (ICL) are three alternative, de facto standard approaches to few-shot learning. ICL has gained popularity recently with the advent of LLMs due to its simplicity and sample efficiency. Prior research has conducted only limited investigation into how these approaches work for multilingual few-shot learning, and the focus so far has been mostly on their performance. In this work, we present an extensive and systematic comparison of the three approaches, testing them on 6 high- and low-resource languages, three different NLU tasks, and a myriad of language and domain setups. Importantly, performance is only one aspect of the comparison, where we also analyse the approaches through the optics of their computational, inference and financial costs. Our observations show that supervised instruction tuning has the best trade-off between performance and resource requirements. As another contribution, we analyse the impact of target language adaptation of pretrained LLMs and find that the standard adaptation approaches can (superficially) improve target language generation capabilities, but language understanding elicited through ICL does not improve and remains limited, with low scores especially for low-resource languages.
Paper Structure (20 sections, 4 figures, 14 tables)

This paper contains 20 sections, 4 figures, 14 tables.

Figures (4)

  • Figure 1: Comparison of practical aspects of different learning paradigms (§\ref{['subsec:learning_paradigms']}) in the intent detection task from Multi3NLU++ moghe-etal-2023-multi3nlu, with exactly the same data setup, for Amharic and Spanish. In-context learning (ICL) has low performance and high inference and computational costs while being comparatively inexpensive. Supervised fine-tuning (SFT) and supervised instruction-tuning (SIT), on the other hand, have a larger financial cost but they are much more efficient in terms of inference aspects and computational resources while also performing much better both for Amharic as a representative low-resource language (\ref{['fig:comparison_practicalities_amh']}) and Spanish as a high-resource language (\ref{['fig:comparison_practicalities_es']}).
  • Figure 2: Intent detection, value extraction and NLI results for the six languages in our evaluation. This performance is in line with other prior work hu-etal-2023-systematic. We exclude LLaMa-2 results as its performance was 0.0 across all tasks. Results for ve in other setups are provided in Appendix \ref{['app:other_ve_results']}.
  • Figure 3: Generation evaluation after target language adaptation (LLaMA-2).
  • Figure 4: Value extraction results for Amharic (am), English (en), Marathi (mr), Spanish (es) and Turkish (tr) for two setups: a) cross-domain in-language; and b) cross-lingual in-domain performance. We exclude ICL-mT0 XL from the plot, as it had 0.0 performance on VE task in these setups. Qualitative analysis of the outputs of ICL-mT0 showed that the outputs neither adhered to the slot-value pair formatting nor included the right values.