Table of Contents
Fetching ...

Analysis of Argument Structure Constructions in the Large Language Model BERT

Pegah Ramezani, Achim Schilling, Patrick Krauss

TL;DR

This research demonstrates the potential of both recurrent and transformer-based neural language models to mirror linguistic processing in the human brain, offering valuable insights into the computational and neural mechanisms underlying language understanding.

Abstract

This study investigates how BERT processes and represents Argument Structure Constructions (ASCs), extending previous LSTM analyses. Using a dataset of 2000 sentences across four ASC types (transitive, ditransitive, caused-motion, resultative), we analyzed BERT's token embeddings across 12 layers. Visualizations with MDS and t-SNE and clustering quantified by Generalized Discrimination Value (GDV) were used. Feedforward classifiers (probes) predicted construction categories from embeddings. CLS token embeddings clustered best in layers 2-4, decreased in intermediate layers, and slightly increased in final layers. DET and SUBJ embeddings showed consistent clustering in intermediate layers, VERB embeddings increased in clustering from layer 1 to 12, and OBJ embeddings peaked in layer 10. Probe accuracies indicated low construction information in layer 1, with over 90 percent accuracy from layer 2 onward, revealing latent construction information beyond GDV clustering. Fisher Discriminant Ratio (FDR) analysis of attention weights showed OBJ tokens were crucial for differentiating ASCs, followed by VERB and DET tokens. SUBJ, CLS, and SEP tokens had insignificant FDR scores. This study highlights BERT's layered processing of linguistic constructions and its differences from LSTMs. Future research will compare these findings with neuroimaging data to understand the neural correlates of ASC processing. This research underscores neural language models' potential to mirror linguistic processing in the human brain, offering insights into the computational and neural mechanisms underlying language understanding.

Analysis of Argument Structure Constructions in the Large Language Model BERT

TL;DR

This research demonstrates the potential of both recurrent and transformer-based neural language models to mirror linguistic processing in the human brain, offering valuable insights into the computational and neural mechanisms underlying language understanding.

Abstract

This study investigates how BERT processes and represents Argument Structure Constructions (ASCs), extending previous LSTM analyses. Using a dataset of 2000 sentences across four ASC types (transitive, ditransitive, caused-motion, resultative), we analyzed BERT's token embeddings across 12 layers. Visualizations with MDS and t-SNE and clustering quantified by Generalized Discrimination Value (GDV) were used. Feedforward classifiers (probes) predicted construction categories from embeddings. CLS token embeddings clustered best in layers 2-4, decreased in intermediate layers, and slightly increased in final layers. DET and SUBJ embeddings showed consistent clustering in intermediate layers, VERB embeddings increased in clustering from layer 1 to 12, and OBJ embeddings peaked in layer 10. Probe accuracies indicated low construction information in layer 1, with over 90 percent accuracy from layer 2 onward, revealing latent construction information beyond GDV clustering. Fisher Discriminant Ratio (FDR) analysis of attention weights showed OBJ tokens were crucial for differentiating ASCs, followed by VERB and DET tokens. SUBJ, CLS, and SEP tokens had insignificant FDR scores. This study highlights BERT's layered processing of linguistic constructions and its differences from LSTMs. Future research will compare these findings with neuroimaging data to understand the neural correlates of ASC processing. This research underscores neural language models' potential to mirror linguistic processing in the human brain, offering insights into the computational and neural mechanisms underlying language understanding.
Paper Structure (27 sections, 5 equations, 5 figures, 2 tables)

This paper contains 27 sections, 5 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: MDS projections of the CLS token embedding, i.e. hidden layer activation, from all hidden layers of the BERT model. Each point represents the activation of a sentence, color-coded according to its ASC type: caused-motion (blue), ditransitive (green), transitive (orange), and resultative (red).
  • Figure 2: t-SNE projections of the CLS token embedding from all hidden layers of the BERT model. Each point represents the activation of a sentence, color-coded according to its ASC type: caused-motion (blue), ditransitive (green), transitive (orange), and resultative (red).
  • Figure 3: GDV score of hidden layer activations. Note that, lower GDV values indicate better-defined clusters. The qualitative results from the MDS and t-SNE projections of the CLS token embeddings are underpinned by the GDV with best clustering occurring in layer 2. The GDV of specific token embeddings reveals distinct patterns of clustering across the BERT layers. DET and SUBJ token embeddings exhibited relatively constant clustering at an intermediate level across all layers, capturing construction-specific information consistently. VERB token embeddings showed a slight but systematic increase in clustering from layer 1 to layer 12, indicating improved differentiation according to construction types in deeper layers. OBJ token embeddings began with no clustering in layer 1 but significantly increased across layers, reaching a clustering level in layer 10 comparable to the CLS token in layer 2. These results highlight the varying contributions of different tokens to the representation of ASCs within BERT, with some tokens showing dynamic changes and others maintaining consistent clustering.
  • Figure 4: Accuracies of probing of hidden layers for classification of constructions per common tokens. Probe accuracies for CLS, DET, SUBJ, VERB, and OBJ tokens start at low or chance levels in layer 1, indicating that the initial embeddings contain no specific information about construction type, as also revealed by the GDV cluster analysis. However, from layer 2 to layer 12, all probe accuracies for different tokens consistently exceeded 90 percent, indicating latent information about construction categories in all token embeddings, even when not revealed through clustering alone.
  • Figure 5: Fisher Discriminant Ratio (FDR) scores for each token across all layers and attention heads. Each dot represents the FDR score of a specific attention head, while the dashed line indicates the mean FDR of all attention heads. Layers with similar FDR scores suggest consistent patterns of attention across those layers.