Table of Contents
Fetching ...

Learning to vary: Teaching LMs to reproduce human linguistic variability in next-word prediction

Tobias Groot, Salo Lacunes, Evgenia Ilia

TL;DR

This paper addresses the misalignment between LM-reproduced variability and human linguistic variability in next-word prediction. By fine-tuning with multiple plausible word continuations, using both pre-trained LMs and instruction-tuned LMs on the Provo Corpus, it demonstrates improved reproduction of human variability as measured by total variation distance between human and model distributions. The results show that preserving and training with multiple labels can yield substantial alignment gains across contexts with varying open-endedness, though there are trade-offs for tasks lacking inherent variability and for certain model scales. Overall, the approach offers a practical path toward embracing human-like variability in generative language models, with implications for robustness and fairness in open-ended NLG tasks.

Abstract

Natural language generation (NLG) tasks are often subject to inherent variability; e.g. predicting the next word given a context has multiple valid responses, evident when asking multiple humans to complete the task. While having language models (LMs) that are aligned pluralistically, so that they are able to reproduce well the inherent diversity in perspectives of an entire population of interest is clearly beneficial, Ilia and Aziz (2024) show that LMs do not reproduce this type of linguistic variability well. They speculate this inability might stem from the lack of consistent training of LMs with data reflecting this type of inherent variability. As such, we investigate whether training LMs on multiple plausible word continuations per context can improve their ability to reproduce human linguistic variability for next-word prediction. We employ fine-tuning techniques for pre-trained and instruction-tuned models; and demonstrate their potential when fine-tuning GPT-2 and Mistral-7B-IT, using Provo Corpus. Our evaluation, which measures divergence among empirically estimated human and model next-word distributions across contexts before and after fine-tuning, shows that our multi-label fine-tuning improves the LMs' ability to reproduce linguistic variability; both for contexts that admit higher and lower variability.

Learning to vary: Teaching LMs to reproduce human linguistic variability in next-word prediction

TL;DR

This paper addresses the misalignment between LM-reproduced variability and human linguistic variability in next-word prediction. By fine-tuning with multiple plausible word continuations, using both pre-trained LMs and instruction-tuned LMs on the Provo Corpus, it demonstrates improved reproduction of human variability as measured by total variation distance between human and model distributions. The results show that preserving and training with multiple labels can yield substantial alignment gains across contexts with varying open-endedness, though there are trade-offs for tasks lacking inherent variability and for certain model scales. Overall, the approach offers a practical path toward embracing human-like variability in generative language models, with implications for robustness and fairness in open-ended NLG tasks.

Abstract

Natural language generation (NLG) tasks are often subject to inherent variability; e.g. predicting the next word given a context has multiple valid responses, evident when asking multiple humans to complete the task. While having language models (LMs) that are aligned pluralistically, so that they are able to reproduce well the inherent diversity in perspectives of an entire population of interest is clearly beneficial, Ilia and Aziz (2024) show that LMs do not reproduce this type of linguistic variability well. They speculate this inability might stem from the lack of consistent training of LMs with data reflecting this type of inherent variability. As such, we investigate whether training LMs on multiple plausible word continuations per context can improve their ability to reproduce human linguistic variability for next-word prediction. We employ fine-tuning techniques for pre-trained and instruction-tuned models; and demonstrate their potential when fine-tuning GPT-2 and Mistral-7B-IT, using Provo Corpus. Our evaluation, which measures divergence among empirically estimated human and model next-word distributions across contexts before and after fine-tuning, shows that our multi-label fine-tuning improves the LMs' ability to reproduce linguistic variability; both for contexts that admit higher and lower variability.

Paper Structure

This paper contains 27 sections, 3 equations, 12 figures, 3 tables.

Figures (12)

  • Figure 1: Distribution of TVD scores (for 1 seed) across contexts. For both GPT-2 and Mistral-7B-IT; fine-tuning shifts the TVD distribution towards the Oracle baseline, suggesting better linguistic alignment with humans.
  • Figure 2: Distribution of TVD scores across contexts, for the two remaining seeds not presented in the main paper. For both GPT-2 and Mistral-7B-IT; fine-tuning shifts the TVD distribution toward the Oracle baseline, suggesting improved alignment with human linguistic variability.
  • Figure 3: Distribution of differences of TVD scores between the model and the human CPDs and the oracle CPDs, for all 3 seeds. For both GPT-2 and Mistral-7B-IT; fine-tuning shifts the TVD distribution towards smaller differences, confirming previous findings.
  • Figure 4: Distribution of differences of TVD scores between the fine tuned model and the human CPDs minus the TVD of the non fine tuned model and the human CPDs, against TVD among oracles. Performance gains (negative differences) for both models occur across contexts of varying open-endedness (with lower TVD indicating more 'restricted' contexts).
  • Figure 5: Distribution of differences of TVD scores between the fine tuned model and the human CPDs minus the TVD of the non fine tuned model and the human CPDs, against TVD among oracles. In this case, we only plot datapoints for which we observed improvements (i.e. negative differences) for both models. Similarly, we observe that gains occur across contexts of varying open-endedness (with lower TVD indicating more 'restricted' contexts).
  • ...and 7 more figures