Table of Contents
Fetching ...

Lost in Inference: Rediscovering the Role of Natural Language Inference for Large Language Models

Lovish Madaan, David Esiobu, Pontus Stenetorp, Barbara Plank, Dieuwke Hupkes

TL;DR

It is found that while the similarity of model distributions with human label distributions increases with scale, it is still much higher than the similarity between two populations of humans, making it a potentially interesting statistic to consider.

Abstract

In the recent past, a popular way of evaluating natural language understanding (NLU), was to consider a model's ability to perform natural language inference (NLI) tasks. In this paper, we investigate if NLI tasks, that are rarely used for LLM evaluation, can still be informative for evaluating LLMs. Focusing on five different NLI benchmarks across six models of different scales, we investigate if they are able to discriminate models of different size and quality and how their accuracies develop during training. Furthermore, we investigate the extent to which the softmax distributions of models align with human distributions in cases where statements are ambiguous or vague. Overall, our results paint a positive picture for the NLI tasks: we find that they are able to discriminate well between models at various stages of training, yet are not (all) saturated. Furthermore, we find that while the similarity of model distributions with human label distributions increases with scale, it is still much higher than the similarity between two populations of humans, making it a potentially interesting statistic to consider.

Lost in Inference: Rediscovering the Role of Natural Language Inference for Large Language Models

TL;DR

It is found that while the similarity of model distributions with human label distributions increases with scale, it is still much higher than the similarity between two populations of humans, making it a potentially interesting statistic to consider.

Abstract

In the recent past, a popular way of evaluating natural language understanding (NLU), was to consider a model's ability to perform natural language inference (NLI) tasks. In this paper, we investigate if NLI tasks, that are rarely used for LLM evaluation, can still be informative for evaluating LLMs. Focusing on five different NLI benchmarks across six models of different scales, we investigate if they are able to discriminate models of different size and quality and how their accuracies develop during training. Furthermore, we investigate the extent to which the softmax distributions of models align with human distributions in cases where statements are ambiguous or vague. Overall, our results paint a positive picture for the NLI tasks: we find that they are able to discriminate well between models at various stages of training, yet are not (all) saturated. Furthermore, we find that while the similarity of model distributions with human label distributions increases with scale, it is still much higher than the similarity between two populations of humans, making it a potentially interesting statistic to consider.

Paper Structure

This paper contains 38 sections, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Performance across shots. We show the accuracies for six fully pre-trained models on the five NLI benchmarks. Dashed lines indicate random and finetuned-BERT baselines.
  • Figure 2: Performance during pre-training. We show how accuracy for the five benchmarks develops during pre-training for two Llama-3 style models.
  • Figure 3: Contamination results. We show the EPG vs the percent of the evaluation dataset marked as contaminated according to different thresholds.
  • Figure 4: Accuracy vs entropy and final model JSDs a) Accuracy vs entropy for Llama8B and Llama 405B. We show how the accuracy of Llama 8B and Llama 405B changes as the entropy of the human label distributions increases. Accuracy-vs-entropy plots for all other models can be found in \ref{['fig:entropy_accuracy_all']}. b) Final model JSDs for each of the benchmarks in ChaosNLI. JSDs are substantially lower than chance and BERT JSDs, but substantially higher than JSDs between humans.
  • Figure 5: Development of JSD during training. We show how the JSD of our trained-from-scratch 8B and 70B model develops during training.
  • ...and 1 more figures