Table of Contents
Fetching ...

Learning Music Helps You Read: Using Transfer to Study Linguistic Structure in Language Models

Isabel Papadimitriou, Dan Jurafsky

TL;DR

This paper investigates how neural LSTM language models acquire abstract syntactic representations by pretraining on non-linguistic structured data and transferring to natural language. The proposed Test for Inductive Bias via Language model Transfer (TILT) isolates structural knowledge by freezing LSTM weights and only fine-tuning embeddings on a new language, removing lexical confounds. The results show that music and code pretraining yield substantial cross-language gains, and that simple artificial grammars with recursive or paired-token structures also help, indicating that non-surface structure drives generalization. In cross-linguistic experiments, transfer strength tracks typological syntactic similarity, suggesting that LSTMs encode cross-language structural properties beyond surface vocabulary.

Abstract

We propose transfer learning as a method for analyzing the encoding of grammatical structure in neural language models. We train LSTMs on non-linguistic data and evaluate their performance on natural language to assess which kinds of data induce generalizable structural features that LSTMs can use for natural language. We find that training on non-linguistic data with latent structure (MIDI music or Java code) improves test performance on natural language, despite no overlap in surface form or vocabulary. To pinpoint the kinds of abstract structure that models may be encoding to lead to this improvement, we run similar experiments with two artificial parentheses languages: one which has a hierarchical recursive structure, and a control which has paired tokens but no recursion. Surprisingly, training a model on either of these artificial languages leads to the same substantial gains when testing on natural language. Further experiments on transfer between natural languages controlling for vocabulary overlap show that zero-shot performance on a test language is highly correlated with typological syntactic similarity to the training language, suggesting that representations induced by pre-training correspond to the cross-linguistic syntactic properties. Our results provide insights into the ways that neural models represent abstract syntactic structure, and also about the kind of structural inductive biases which allow for natural language acquisition.

Learning Music Helps You Read: Using Transfer to Study Linguistic Structure in Language Models

TL;DR

This paper investigates how neural LSTM language models acquire abstract syntactic representations by pretraining on non-linguistic structured data and transferring to natural language. The proposed Test for Inductive Bias via Language model Transfer (TILT) isolates structural knowledge by freezing LSTM weights and only fine-tuning embeddings on a new language, removing lexical confounds. The results show that music and code pretraining yield substantial cross-language gains, and that simple artificial grammars with recursive or paired-token structures also help, indicating that non-surface structure drives generalization. In cross-linguistic experiments, transfer strength tracks typological syntactic similarity, suggesting that LSTMs encode cross-language structural properties beyond surface vocabulary.

Abstract

We propose transfer learning as a method for analyzing the encoding of grammatical structure in neural language models. We train LSTMs on non-linguistic data and evaluate their performance on natural language to assess which kinds of data induce generalizable structural features that LSTMs can use for natural language. We find that training on non-linguistic data with latent structure (MIDI music or Java code) improves test performance on natural language, despite no overlap in surface form or vocabulary. To pinpoint the kinds of abstract structure that models may be encoding to lead to this improvement, we run similar experiments with two artificial parentheses languages: one which has a hierarchical recursive structure, and a control which has paired tokens but no recursion. Surprisingly, training a model on either of these artificial languages leads to the same substantial gains when testing on natural language. Further experiments on transfer between natural languages controlling for vocabulary overlap show that zero-shot performance on a test language is highly correlated with typological syntactic similarity to the training language, suggesting that representations induced by pre-training correspond to the cross-linguistic syntactic properties. Our results provide insights into the ways that neural models represent abstract syntactic structure, and also about the kind of structural inductive biases which allow for natural language acquisition.

Paper Structure

This paper contains 15 sections, 5 figures, 1 table.

Figures (5)

  • Figure 1: We find that LSTM LMs can utilize various types of non-linguistic structure to help learn to model human language, and that nested hierarchical structure does not lead to more expressive encodings than flat, head-dependency pair structure. We also find that LSTM LMs learn representations that correlate with typological syntactic feature distance, allowing them to transfer more effectively from languages which are grammatically similar.
  • Figure 2: Diagram illustrating our training procedure: $k$ models are trained on $k$ L1 languages, and then their LSTM weights are frozen while their linear layers are finetuned on a common L2 language (in our case, we always use Spanish as the L2). We can then compare their performance on the common L2.
  • Figure 3: Examples illustrating the content of our non-linguistic corpora for Experiments 1-3. All examples are taken from the corpora.
  • Figure 4: Results of Experiments 1 through 3, training on non-linguistic corpora. Error bars on all bars indicate a 95% $t$-test confidence interval over 5 restarts with different random seeds. All structured data is much better to train on than random data, including music which has a totally divergent vocabulary surface form from the rest. The two parentheses corpora result in equivalent perplexities, even though one has a hierarchical underlying structure and the other does not.
  • Figure 5: Results of Experiment 4. Transfer is better between typologically similar languages, even when vocabularies are disjoint. Perplexity on Spanish test data plotted against the WALS-syntax distance of each model's L1 to Spanish. The relationship is almost linear for Indo-European languages, and then reaches a ceiling. Error bars show 95% CIs for $n=5$ trials with different random seeds. These results demonstrate how LSTMs can transfer knowledge more easily to languages that share structural features with the L1, and that this correlation is robust to multiple trials. The orange line represents the oracle perplexity of training all parameters to convergence on the L2 train data. Romance languages are in red, other Indo-European languages are in purple, and non-Indo-European languages are blue.