Table of Contents
Fetching ...

Forget NLI, Use a Dictionary: Zero-Shot Topic Classification for Low-Resource Languages with Application to Luxembourgish

Fred Philippy, Shohreh Haddadan, Siwen Guo

TL;DR

The paper tackles zero-shot topic classification in low-resource languages by replacing NLI-based transfer with a dictionary-driven approach for Luxembourgish. It introduces two dataset variants, LETZ-SYN and LETZ-WoT, derived from a public dictionary to support zero-shot classification with a simple entailment framework. Empirical results show that models trained on the dictionary-based data outperform NLI-based baselines, especially in low-resource settings, and achieve comparable or better performance with far fewer labeled examples. The approach demonstrates strong potential for generalization to other languages where dictionaries are available, offering a practical path to enhanced semantic classification in data-scarce environments.

Abstract

In NLP, zero-shot classification (ZSC) is the task of assigning labels to textual data without any labeled examples for the target classes. A common method for ZSC is to fine-tune a language model on a Natural Language Inference (NLI) dataset and then use it to infer the entailment between the input document and the target labels. However, this approach faces certain challenges, particularly for languages with limited resources. In this paper, we propose an alternative solution that leverages dictionaries as a source of data for ZSC. We focus on Luxembourgish, a low-resource language spoken in Luxembourg, and construct two new topic relevance classification datasets based on a dictionary that provides various synonyms, word translations and example sentences. We evaluate the usability of our dataset and compare it with the NLI-based approach on two topic classification tasks in a zero-shot manner. Our results show that by using the dictionary-based dataset, the trained models outperform the ones following the NLI-based approach for ZSC. While we focus on a single low-resource language in this study, we believe that the efficacy of our approach can also transfer to other languages where such a dictionary is available.

Forget NLI, Use a Dictionary: Zero-Shot Topic Classification for Low-Resource Languages with Application to Luxembourgish

TL;DR

The paper tackles zero-shot topic classification in low-resource languages by replacing NLI-based transfer with a dictionary-driven approach for Luxembourgish. It introduces two dataset variants, LETZ-SYN and LETZ-WoT, derived from a public dictionary to support zero-shot classification with a simple entailment framework. Empirical results show that models trained on the dictionary-based data outperform NLI-based baselines, especially in low-resource settings, and achieve comparable or better performance with far fewer labeled examples. The approach demonstrates strong potential for generalization to other languages where dictionaries are available, offering a practical path to enhanced semantic classification in data-scarce environments.

Abstract

In NLP, zero-shot classification (ZSC) is the task of assigning labels to textual data without any labeled examples for the target classes. A common method for ZSC is to fine-tune a language model on a Natural Language Inference (NLI) dataset and then use it to infer the entailment between the input document and the target labels. However, this approach faces certain challenges, particularly for languages with limited resources. In this paper, we propose an alternative solution that leverages dictionaries as a source of data for ZSC. We focus on Luxembourgish, a low-resource language spoken in Luxembourg, and construct two new topic relevance classification datasets based on a dictionary that provides various synonyms, word translations and example sentences. We evaluate the usability of our dataset and compare it with the NLI-based approach on two topic classification tasks in a zero-shot manner. Our results show that by using the dictionary-based dataset, the trained models outperform the ones following the NLI-based approach for ZSC. While we focus on a single low-resource language in this study, we believe that the efficacy of our approach can also transfer to other languages where such a dictionary is available.
Paper Structure (21 sections, 1 equation, 2 figures, 4 tables)

This paper contains 21 sections, 1 equation, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Distribution of text sample length, expressed in terms of word count, for the training, validation and test sets of LETZ-SYN.
  • Figure 2: Illustration of the entailment approachyin_benchmarking_2019 for ZSC.