Table of Contents
Fetching ...

Boosting Zero-Shot Crosslingual Performance using LLM-Based Augmentations with Effective Data Selection

Barah Fazili, Ashish Sunil Agrawal, Preethi Jyothi

TL;DR

This work tackles zero-shot cross-lingual transfer to low-resource target languages by generating task-specific data with an open LLM in English and optionally translating to target languages. It pairs the synthetic data with either soft pseudolabels from a trained teacher (teacher-student) or hard prompt labels (prompt-driven) and employs data-selection strategies to curate a compact, diverse augmentation set, dramatically improving cross-lingual performance. The study introduces multiple data-selection techniques (rand-k, top-k, div-k, amb-k, easy-k) and evaluates two training paradigms across sentiment analysis and natural language inference tasks in Hindi, Marathi, Urdu, and Swahili, achieving gains up to 7.13 absolute points and consistently outperforming baselines. Key findings show that soft labels outperform hard labels, that data volume and class balance matter, and that diversity in the augmented data correlates with better generalization; cross-domain prompts and target-train data can further boost results. Overall, the approach demonstrates data-efficient, scalable strategies for improving multilingual transfer without target-language labels, with practical implications for deploying robust cross-lingual systems.

Abstract

Large language models (LLMs) are very proficient text generators. We leverage this capability of LLMs to generate task-specific data via zero-shot prompting and promote cross-lingual transfer for low-resource target languages. Given task-specific data in a source language and a teacher model trained on this data, we propose using this teacher to label LLM generations and employ a set of simple data selection strategies that use the teacher's label probabilities. Our data selection strategies help us identify a representative subset of diverse generations that help boost zero-shot accuracies while being efficient, in comparison to using all the LLM generations (without any subset selection). We also highlight other important design choices that affect cross-lingual performance such as the use of translations of source data and what labels are best to use for the LLM generations. We observe significant performance gains across sentiment analysis and natural language inference tasks (of up to a maximum of 7.13 absolute points and 1.5 absolute points on average) across a number of target languages (Hindi, Marathi, Urdu, Swahili) and domains.

Boosting Zero-Shot Crosslingual Performance using LLM-Based Augmentations with Effective Data Selection

TL;DR

This work tackles zero-shot cross-lingual transfer to low-resource target languages by generating task-specific data with an open LLM in English and optionally translating to target languages. It pairs the synthetic data with either soft pseudolabels from a trained teacher (teacher-student) or hard prompt labels (prompt-driven) and employs data-selection strategies to curate a compact, diverse augmentation set, dramatically improving cross-lingual performance. The study introduces multiple data-selection techniques (rand-k, top-k, div-k, amb-k, easy-k) and evaluates two training paradigms across sentiment analysis and natural language inference tasks in Hindi, Marathi, Urdu, and Swahili, achieving gains up to 7.13 absolute points and consistently outperforming baselines. Key findings show that soft labels outperform hard labels, that data volume and class balance matter, and that diversity in the augmented data correlates with better generalization; cross-domain prompts and target-train data can further boost results. Overall, the approach demonstrates data-efficient, scalable strategies for improving multilingual transfer without target-language labels, with practical implications for deploying robust cross-lingual systems.

Abstract

Large language models (LLMs) are very proficient text generators. We leverage this capability of LLMs to generate task-specific data via zero-shot prompting and promote cross-lingual transfer for low-resource target languages. Given task-specific data in a source language and a teacher model trained on this data, we propose using this teacher to label LLM generations and employ a set of simple data selection strategies that use the teacher's label probabilities. Our data selection strategies help us identify a representative subset of diverse generations that help boost zero-shot accuracies while being efficient, in comparison to using all the LLM generations (without any subset selection). We also highlight other important design choices that affect cross-lingual performance such as the use of translations of source data and what labels are best to use for the LLM generations. We observe significant performance gains across sentiment analysis and natural language inference tasks (of up to a maximum of 7.13 absolute points and 1.5 absolute points on average) across a number of target languages (Hindi, Marathi, Urdu, Swahili) and domains.
Paper Structure (38 sections, 1 equation, 2 figures, 29 tables)

This paper contains 38 sections, 1 equation, 2 figures, 29 tables.

Figures (2)

  • Figure 1: Overall schematic illustrating various aspects of LLM-based augmentation.
  • Figure 2: Diversity scores of augmented data for different data selection strategies and different tasks.