Table of Contents
Fetching ...

ProxyLM: Predicting Language Model Performance on Multilingual Tasks via Proxy Models

David Anugraha, Genta Indra Winata, Chenyue Li, Patrick Amadeus Irawan, En-Shiun Annie Lee

TL;DR

LM fine-tuning and evaluation across multilingual tasks are prohibitively expensive. ProxyLM introduces a scalable framework that uses proxy models as surrogates and a regressor to predict the target LM's performance, formalized as $\hat{y}_{\mathcal{M}} = g(\hat{y}_{\mathcal{M}_p}; \Phi(\mathcal{D},\mathcal{D}'); \Psi(\mathcal{L}_s, \mathcal{L}_t))$, enabling task- and language-agnostic predictions. The approach leverages NLPerf features (language, dataset, and proxy-model signals) plus distribution-shift cues, and demonstrates up to $37.08\times$ time savings with small proxies while achieving RMSE improvements of at least $1.78\times$ over baselines across MT and MASSIVE tasks, including unseen languages. Ensemble proxy models generally yield the best predictions, with robust performance in Unseen and Cross-Dataset settings and favorable generalization across languages and domains. This work offers a cost-effective, flexible tool for model selection and evaluation in multilingual NLP and opens avenues for extending proxy-based prediction to additional tasks and language families.

Abstract

Performance prediction is a method to estimate the performance of Language Models (LMs) on various Natural Language Processing (NLP) tasks, mitigating computational costs associated with model capacity and data for fine-tuning. Our paper presents ProxyLM, a scalable task- and language-agnostic framework designed to predict the performance of LMs using proxy models. These proxy models act as surrogates, approximating the performance of the LM of interest. By leveraging these proxy models, ProxyLM significantly reduces computational overhead in task evaluations, achieving up to a 37.08x speedup over traditional methods, even with our smallest proxy models. Our results across multiple multilingual NLP tasks and various robustness tests demonstrate that ProxyLM not only adapts well to previously unseen languages in pre-trained LMs, but also generalizes effectively across different datasets, outperforming the state-of-the-art by at least 1.78x in terms of root-mean-square error (RMSE).

ProxyLM: Predicting Language Model Performance on Multilingual Tasks via Proxy Models

TL;DR

LM fine-tuning and evaluation across multilingual tasks are prohibitively expensive. ProxyLM introduces a scalable framework that uses proxy models as surrogates and a regressor to predict the target LM's performance, formalized as , enabling task- and language-agnostic predictions. The approach leverages NLPerf features (language, dataset, and proxy-model signals) plus distribution-shift cues, and demonstrates up to time savings with small proxies while achieving RMSE improvements of at least over baselines across MT and MASSIVE tasks, including unseen languages. Ensemble proxy models generally yield the best predictions, with robust performance in Unseen and Cross-Dataset settings and favorable generalization across languages and domains. This work offers a cost-effective, flexible tool for model selection and evaluation in multilingual NLP and opens avenues for extending proxy-based prediction to additional tasks and language families.

Abstract

Performance prediction is a method to estimate the performance of Language Models (LMs) on various Natural Language Processing (NLP) tasks, mitigating computational costs associated with model capacity and data for fine-tuning. Our paper presents ProxyLM, a scalable task- and language-agnostic framework designed to predict the performance of LMs using proxy models. These proxy models act as surrogates, approximating the performance of the LM of interest. By leveraging these proxy models, ProxyLM significantly reduces computational overhead in task evaluations, achieving up to a 37.08x speedup over traditional methods, even with our smallest proxy models. Our results across multiple multilingual NLP tasks and various robustness tests demonstrate that ProxyLM not only adapts well to previously unseen languages in pre-trained LMs, but also generalizes effectively across different datasets, outperforming the state-of-the-art by at least 1.78x in terms of root-mean-square error (RMSE).
Paper Structure (44 sections, 3 equations, 13 figures, 28 tables)

This paper contains 44 sections, 3 equations, 13 figures, 28 tables.

Figures (13)

  • Figure 1: ProxyLM framework for LM performance prediction. (Top) The evaluation metric is computed on the test set using a proxy model$\mathcal{M}_p^i$. (Bottom) The regressor $g$ is trained using proxy model scores as well as dataset and language features by minimizing the RMSE difference of $y_\mathcal{M}$ and $\hat{y}_\mathcal{M}$.
  • Figure 2: Unseen and Cross-Dataset MT test results on English-centric dataset in average RMSE (lower is better). We only show the best-performing baseline for comparison with ProxyLM with different proxy models. "No FT" denotes "no fine-tuning". We only show M2M100 results for the Unseen setting since NLLB covers all languages in the English-centric dataset. The reported results for the Unseen setting use XGBoost, while the Cross-Dataset experiments use LGBM. Ensemble denotes combining all four proxy models. The detailed breakdown of this result with the standard deviation can be seen in the Appendix Section \ref{['sec:ensemble-breakdown']}.
  • Figure 3: Ablation study on the LOLO setting with XGBoost on English-centric and Many-to-Many Languages datasets. Proxy Models here indicates Ensemble, which is a combination of all proxy models. Proxy Models significantly reduce RMSE across all scenarios.
  • Figure 4: Detailed results of XGBoost with ProxyLM Ensemble on M2M100 model under the LOLO setting on MT using the English-centric dataset from Table \ref{['tab:results-mt']}. The results are grouped by (a) Joshi Class and (b) language family that follows the mapping which is provided in Appendix \ref{['sec:more-info-langs']}; (c) shows the scatter plot illustrating the correlation of spBLEU scores between the ProxyLM's prediction and estimated LM, with the light gray dashed line representing the line of equality ($y = x$) with $R^2 = 0.90$ and black dashed line representing Locally Weighted Scatterplot Smoothing (LOWESS) curve to represent the trend.
  • Figure 5: Detailed results of XGBoost with ProxyLM Ensemble on the M2M100 model under the LOLO setting using the English-centric dataset on MT task from Table \ref{['tab:app-results-english-centric-detail']} per languages.
  • ...and 8 more figures