ProxyLM: Predicting Language Model Performance on Multilingual Tasks via Proxy Models
David Anugraha, Genta Indra Winata, Chenyue Li, Patrick Amadeus Irawan, En-Shiun Annie Lee
TL;DR
LM fine-tuning and evaluation across multilingual tasks are prohibitively expensive. ProxyLM introduces a scalable framework that uses proxy models as surrogates and a regressor to predict the target LM's performance, formalized as $\hat{y}_{\mathcal{M}} = g(\hat{y}_{\mathcal{M}_p}; \Phi(\mathcal{D},\mathcal{D}'); \Psi(\mathcal{L}_s, \mathcal{L}_t))$, enabling task- and language-agnostic predictions. The approach leverages NLPerf features (language, dataset, and proxy-model signals) plus distribution-shift cues, and demonstrates up to $37.08\times$ time savings with small proxies while achieving RMSE improvements of at least $1.78\times$ over baselines across MT and MASSIVE tasks, including unseen languages. Ensemble proxy models generally yield the best predictions, with robust performance in Unseen and Cross-Dataset settings and favorable generalization across languages and domains. This work offers a cost-effective, flexible tool for model selection and evaluation in multilingual NLP and opens avenues for extending proxy-based prediction to additional tasks and language families.
Abstract
Performance prediction is a method to estimate the performance of Language Models (LMs) on various Natural Language Processing (NLP) tasks, mitigating computational costs associated with model capacity and data for fine-tuning. Our paper presents ProxyLM, a scalable task- and language-agnostic framework designed to predict the performance of LMs using proxy models. These proxy models act as surrogates, approximating the performance of the LM of interest. By leveraging these proxy models, ProxyLM significantly reduces computational overhead in task evaluations, achieving up to a 37.08x speedup over traditional methods, even with our smallest proxy models. Our results across multiple multilingual NLP tasks and various robustness tests demonstrate that ProxyLM not only adapts well to previously unseen languages in pre-trained LMs, but also generalizes effectively across different datasets, outperforming the state-of-the-art by at least 1.78x in terms of root-mean-square error (RMSE).
