Table of Contents
Fetching ...

Encoder vs Decoder: Comparative Analysis of Encoder and Decoder Language Models on Multilingual NLU Tasks

Dan Saattrup Nielsen, Kenneth Enevoldsen, Peter Schneider-Kamp

TL;DR

The paper investigates whether encoder or decoder language models are better suited for multilingual NLU across Germanic languages by extending ScandEval to include decoder models and additional languages (German, Dutch, English). It reframes NLU tasks as few-shot generative tasks and introduces a robust score-aggregation method, providing public leaderboards. The key finding is that encoder models typically outperform decoder models, even with far fewer parameters, though results vary by language and task; decoders exhibit a strong QA bias and follow different performance trajectories as revealed by UMAP analyses. This work informs model selection for multilingual NLU and supplies a standardized, cross-paradigm evaluation framework with open resources.

Abstract

This paper explores the performance of encoder and decoder language models on multilingual Natural Language Understanding (NLU) tasks, with a broad focus on Germanic languages. Building upon the ScandEval benchmark, initially restricted to evaluating encoder models, we extend the evaluation framework to include decoder models. We introduce a method for evaluating decoder models on NLU tasks and apply it to the languages Danish, Swedish, Norwegian, Icelandic, Faroese, German, Dutch, and English. Through a series of experiments and analyses, we also address research questions regarding the comparative performance of encoder and decoder models, the impact of NLU task types, and the variation across language resources. Our findings reveal that encoder models can achieve significantly better NLU performance than decoder models despite having orders of magnitude fewer parameters. Additionally, we investigate the correlation between decoders and task performance via a UMAP analysis, shedding light on the unique capabilities of decoder and encoder models. This study contributes to a deeper understanding of language model paradigms in NLU tasks and provides valuable insights for model selection and evaluation in multilingual settings.

Encoder vs Decoder: Comparative Analysis of Encoder and Decoder Language Models on Multilingual NLU Tasks

TL;DR

The paper investigates whether encoder or decoder language models are better suited for multilingual NLU across Germanic languages by extending ScandEval to include decoder models and additional languages (German, Dutch, English). It reframes NLU tasks as few-shot generative tasks and introduces a robust score-aggregation method, providing public leaderboards. The key finding is that encoder models typically outperform decoder models, even with far fewer parameters, though results vary by language and task; decoders exhibit a strong QA bias and follow different performance trajectories as revealed by UMAP analyses. This work informs model selection for multilingual NLU and supplies a standardized, cross-paradigm evaluation framework with open resources.

Abstract

This paper explores the performance of encoder and decoder language models on multilingual Natural Language Understanding (NLU) tasks, with a broad focus on Germanic languages. Building upon the ScandEval benchmark, initially restricted to evaluating encoder models, we extend the evaluation framework to include decoder models. We introduce a method for evaluating decoder models on NLU tasks and apply it to the languages Danish, Swedish, Norwegian, Icelandic, Faroese, German, Dutch, and English. Through a series of experiments and analyses, we also address research questions regarding the comparative performance of encoder and decoder models, the impact of NLU task types, and the variation across language resources. Our findings reveal that encoder models can achieve significantly better NLU performance than decoder models despite having orders of magnitude fewer parameters. Additionally, we investigate the correlation between decoders and task performance via a UMAP analysis, shedding light on the unique capabilities of decoder and encoder models. This study contributes to a deeper understanding of language model paradigms in NLU tasks and provides valuable insights for model selection and evaluation in multilingual settings.
Paper Structure (21 sections, 2 figures, 15 tables, 1 algorithm)

This paper contains 21 sections, 2 figures, 15 tables, 1 algorithm.

Figures (2)

  • Figure 1: UMAP plots of the models on the ScandEval leaderboards.
  • Figure 2: The correlation between a model being generative and its performance on the four NLU tasks.