Table of Contents
Fetching ...

Derivational Morphology Reveals Analogical Generalization in Large Language Models

Valentin Hofmann, Leonie Weissweiler, David Mortensen, Hinrich Schütze, Janet Pierrehumbert

Abstract

What mechanisms underlie linguistic generalization in large language models (LLMs)? This question has attracted considerable attention, with most studies analyzing the extent to which the language skills of LLMs resemble rules. As of yet, it is not known whether linguistic generalization in LLMs could equally well be explained as the result of analogical processes, which can be formalized as similarity operations on stored exemplars. A key shortcoming of prior research is its focus on linguistic phenomena with a high degree of regularity, for which rule-based and analogical approaches make the same predictions. Here, we instead examine derivational morphology, specifically English adjective nominalization, which displays notable variability. We introduce a new method for investigating linguistic generalization in LLMs: focusing on GPT-J, we fit cognitive models that instantiate rule-based and analogical learning to the LLM training data and compare their predictions on a set of nonce adjectives with those of the LLM, allowing us to draw direct conclusions regarding underlying mechanisms. As expected, rule-based and analogical models explain the predictions of GPT-J equally well for adjectives with regular nominalization patterns. However, for adjectives with variable nominalization patterns, the analogical model provides a much better match. Furthermore, GPT-J's behavior is sensitive to the individual word frequencies, even for regular forms, a behavior that is consistent with an analogical account of regular forms but not a rule-based one. These findings refute the hypothesis that GPT-J's linguistic generalization on adjective nominalization involves rules, suggesting similarity operations on stored exemplars as the underlying mechanism. Overall, our study suggests that analogical processes play a bigger role in the linguistic generalization of LLMs than previously thought.

Derivational Morphology Reveals Analogical Generalization in Large Language Models

Abstract

What mechanisms underlie linguistic generalization in large language models (LLMs)? This question has attracted considerable attention, with most studies analyzing the extent to which the language skills of LLMs resemble rules. As of yet, it is not known whether linguistic generalization in LLMs could equally well be explained as the result of analogical processes, which can be formalized as similarity operations on stored exemplars. A key shortcoming of prior research is its focus on linguistic phenomena with a high degree of regularity, for which rule-based and analogical approaches make the same predictions. Here, we instead examine derivational morphology, specifically English adjective nominalization, which displays notable variability. We introduce a new method for investigating linguistic generalization in LLMs: focusing on GPT-J, we fit cognitive models that instantiate rule-based and analogical learning to the LLM training data and compare their predictions on a set of nonce adjectives with those of the LLM, allowing us to draw direct conclusions regarding underlying mechanisms. As expected, rule-based and analogical models explain the predictions of GPT-J equally well for adjectives with regular nominalization patterns. However, for adjectives with variable nominalization patterns, the analogical model provides a much better match. Furthermore, GPT-J's behavior is sensitive to the individual word frequencies, even for regular forms, a behavior that is consistent with an analogical account of regular forms but not a rule-based one. These findings refute the hypothesis that GPT-J's linguistic generalization on adjective nominalization involves rules, suggesting similarity operations on stored exemplars as the underlying mechanism. Overall, our study suggests that analogical processes play a bigger role in the linguistic generalization of LLMs than previously thought.

Paper Structure

This paper contains 23 sections, 3 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Distribution of preferred nominalization type (specifically, ratio of -ness derivatives) for unseen nonce adjectives, for rule-based models (a, b), exemplar-based models (c, d), and GPT-J (e). Models based on types are shown on the left (a, c), and models based on tokens are shown on the right (b, d). The ratio is computed as the number of -ness predictions divided by the total number of predictions (i.e., -ness and -ity predictions).
  • Figure 2: Ratio of bases preferring -ness in the Pile (a) and GPT-J's predictions with one example prompt (b). Results are similar for the other prompts. The suffixes of the base (i.e., adjective classes) are grouped by degree of competition between -ity and -ness.
  • Figure 3: Impact of word frequency on GPT-J's confidence in its choice. x-axis: Log probability difference between the attested and unattested choices for low-frequency derivatives with $f \in (0, 10]$. We have converted the log probabilities from base $e$ to base $10$ for better readability. y-axis: Relative increase in confidence for high-frequency derivatives with $f \in (100, \infty)$. Each dot corresponds to GPT-J's predictions for an adjective class given a specific prompt. Dots are colored by degree of competition between -ity and -ness. We added LOWESS lines for r-ness and r-ity. Dots at $y= 0\%$ indicate the expected behavior if r-ness and r-ity were handled by rule.
  • Figure 4: Impact of morphological decomposability of words on their familiarity as rated by human annotators (a) and the log probability assigned to them by GPT-J (b).
  • Figure 5: Distribution of preferred nominalization type (specifically, ratio of -ness derivatives) for unseen nonce adjectives, for GPT-J (a) and human annotators (b). The ratio is computed as the number of -ness predictions divided by the total number of predictions. Panel (a) replicates Figure \ref{['fig:gptj-winner-ratio-unseen']} from the main text for easier comparison.
  • ...and 2 more figures