Table of Contents
Fetching ...

Verbing Weirds Language (Models): Evaluation of English Zero-Derivation in Five LLMs

David R. Mortensen, Valentina Izrailevitch, Yunze Xiao, Hinrich Schütze, Leonie Weissweiler

TL;DR

The paper investigates lexical-syntactic flexibility in English—specifically conversion or zero-derivation—by framing a natural language inference task that tests how models generalize words into non-prototypical parts of speech. It introduces a dataset of 3,069 prompts built from frames and word lists (transitive verbs, count/mass nouns, and nonce words) drawn from UniMorph, and evaluates five LLMs (GPT-3.5, GPT-4, Mistral 7B, Falcon 40B, Llama 2 70B). The results show GPT-4 achieving the strongest performance, with GPT-3.5 and open-source models capable but generally trailing, and crucially reveal that model size does not monotonically predict lexical-syntactic flexibility. The study provides a methodology and dataset for evaluating conversion in LLMs and suggests future work exploring more frames and open models to disentangle true generalization from training data exposure.

Abstract

Lexical-syntactic flexibility, in the form of conversion (or zero-derivation) is a hallmark of English morphology. In conversion, a word with one part of speech is placed in a non-prototypical context, where it is coerced to behave as if it had a different part of speech. However, while this process affects a large part of the English lexicon, little work has been done to establish the degree to which language models capture this type of generalization. This paper reports the first study on the behavior of large language models with reference to conversion. We design a task for testing lexical-syntactic flexibility -- the degree to which models can generalize over words in a construction with a non-prototypical part of speech. This task is situated within a natural language inference paradigm. We test the abilities of five language models -- two proprietary models (GPT-3.5 and GPT-4), three open-source models (Mistral 7B, Falcon 40B, and Llama 2 70B). We find that GPT-4 performs best on the task, followed by GPT-3.5, but that the open source language models are also able to perform it and that the 7B parameter Mistral displays as little difference between its baseline performance on the natural language inference task and the non-prototypical syntactic category task, as the massive GPT-4.

Verbing Weirds Language (Models): Evaluation of English Zero-Derivation in Five LLMs

TL;DR

The paper investigates lexical-syntactic flexibility in English—specifically conversion or zero-derivation—by framing a natural language inference task that tests how models generalize words into non-prototypical parts of speech. It introduces a dataset of 3,069 prompts built from frames and word lists (transitive verbs, count/mass nouns, and nonce words) drawn from UniMorph, and evaluates five LLMs (GPT-3.5, GPT-4, Mistral 7B, Falcon 40B, Llama 2 70B). The results show GPT-4 achieving the strongest performance, with GPT-3.5 and open-source models capable but generally trailing, and crucially reveal that model size does not monotonically predict lexical-syntactic flexibility. The study provides a methodology and dataset for evaluating conversion in LLMs and suggests future work exploring more frames and open models to disentangle true generalization from training data exposure.

Abstract

Lexical-syntactic flexibility, in the form of conversion (or zero-derivation) is a hallmark of English morphology. In conversion, a word with one part of speech is placed in a non-prototypical context, where it is coerced to behave as if it had a different part of speech. However, while this process affects a large part of the English lexicon, little work has been done to establish the degree to which language models capture this type of generalization. This paper reports the first study on the behavior of large language models with reference to conversion. We design a task for testing lexical-syntactic flexibility -- the degree to which models can generalize over words in a construction with a non-prototypical part of speech. This task is situated within a natural language inference paradigm. We test the abilities of five language models -- two proprietary models (GPT-3.5 and GPT-4), three open-source models (Mistral 7B, Falcon 40B, and Llama 2 70B). We find that GPT-4 performs best on the task, followed by GPT-3.5, but that the open source language models are also able to perform it and that the 7B parameter Mistral displays as little difference between its baseline performance on the natural language inference task and the non-prototypical syntactic category task, as the massive GPT-4.
Paper Structure (10 sections, 2 figures, 4 tables)

This paper contains 10 sections, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Calvin and Hobbes © 1993 Watterson. Reprinted with permission of Andrews McMeel Syndication. All rights reserved.
  • Figure 2: Average accuracy grouped by model and typicality (p- for "prototypical," np- for "non-prototypical," and no- for "nonce")