Table of Contents
Fetching ...

Improving the OOD Performance of Closed-Source LLMs on NLI Through Strategic Data Selection

Joe Stacey, Lisa Alazraki, Aran Ubhi, Beyza Ermis, Aaron Mueller, Marek Rei

TL;DR

The paper tackles the challenge that fine-tuning closed-source LLMs yields strong ID gains but harms OOD robustness in NLI. It introduces data-centric strategies under a fixed budget, including uncertainty, difficulty, misclassification, and concatenative sampling, plus LLM-based synthetic data generation with varying complexity. Empirical results show autoregressive LLMs outperform encoder baselines on OOD, and targeted data selection plus synthetic data generation substantially improve robustness across diverse OOD datasets, with complexity-aware prompting further enhancing gains. The findings advocate for autoregressive LLMs as robust baselines and demonstrate practical, cost-conscious methods to improve OOD performance for closed-source models in real-world settings.

Abstract

We investigate the robustness of fine-tuned Large Language Models (LLMs) for the task of Natural Language Inference (NLI), finding that the in-distribution gains from fine-tuning correspond to a large drop in out-of-distribution (OOD) performance. Despite the widespread use of closed-source LLMs, there are no robustness mitigation methods that work under their API fine-tuning constraints. Existing methods to improve robustness typically require changing the fine-tuning process or large-scale data augmentation, methods that are infeasible or cost prohibitive for closed-source models. To address this, we propose strategically selecting the NLI fine-tuning data, prioritising more complex examples or replacing existing training examples with LLM-generated data. Prioritising more complex training examples improves performance on challenging OOD NLI datasets, while training with synthetic data leads to substantial improvements on easier OOD datasets. We find that synthetic examples are often too simple, and by prompting LLMs to create more complex synthetic data we can improve performance on both easy and challenging OOD datasets. Finally, we show that recent autoregressive LLMs are substantially more robust to distributional shifts compared to encoder models, and should be a preferred baseline for future research.

Improving the OOD Performance of Closed-Source LLMs on NLI Through Strategic Data Selection

TL;DR

The paper tackles the challenge that fine-tuning closed-source LLMs yields strong ID gains but harms OOD robustness in NLI. It introduces data-centric strategies under a fixed budget, including uncertainty, difficulty, misclassification, and concatenative sampling, plus LLM-based synthetic data generation with varying complexity. Empirical results show autoregressive LLMs outperform encoder baselines on OOD, and targeted data selection plus synthetic data generation substantially improve robustness across diverse OOD datasets, with complexity-aware prompting further enhancing gains. The findings advocate for autoregressive LLMs as robust baselines and demonstrate practical, cost-conscious methods to improve OOD performance for closed-source models in real-world settings.

Abstract

We investigate the robustness of fine-tuned Large Language Models (LLMs) for the task of Natural Language Inference (NLI), finding that the in-distribution gains from fine-tuning correspond to a large drop in out-of-distribution (OOD) performance. Despite the widespread use of closed-source LLMs, there are no robustness mitigation methods that work under their API fine-tuning constraints. Existing methods to improve robustness typically require changing the fine-tuning process or large-scale data augmentation, methods that are infeasible or cost prohibitive for closed-source models. To address this, we propose strategically selecting the NLI fine-tuning data, prioritising more complex examples or replacing existing training examples with LLM-generated data. Prioritising more complex training examples improves performance on challenging OOD NLI datasets, while training with synthetic data leads to substantial improvements on easier OOD datasets. We find that synthetic examples are often too simple, and by prompting LLMs to create more complex synthetic data we can improve performance on both easy and challenging OOD datasets. Finally, we show that recent autoregressive LLMs are substantially more robust to distributional shifts compared to encoder models, and should be a preferred baseline for future research.

Paper Structure

This paper contains 56 sections, 2 figures, 20 tables.

Figures (2)

  • Figure 1: Examples of a training instance in $\mathcal{D}_\text{up}$ from our different methods.
  • Figure 2: Examples from the different NLI test sets used for model evaluation. The examples from Challenge-OOD datasets are more difficult than those from SNLI or the Standard-OOD datasets.