Table of Contents
Fetching ...

Increasing The Performance of Cognitively Inspired Data-Efficient Language Models via Implicit Structure Building

Omar Momen, David Arps, Laura Kallmeyer

TL;DR

The study examines data-efficient language modeling by injecting implicit hierarchical structure via StructFormer-inspired architectures. Using seven model variants pretrained on a 10M-word BabyLM corpus, the authors evaluate on 39 tasks including BLiMP, SuperGLUE, and MSGS to assess whether hierarchical biases improve performance over a RoBERTa baseline. Results show task-dependent benefits but no consistent overall gains, with the best aggregate model (SR_s1') achieving selection on the shared task, and middle-layer parser placement underperforming in many settings. The work highlights the potential of hierarchical biases but also the need for careful task-wise analysis and more controlled experiments to isolate the conditions under which such biases are beneficial.

Abstract

In this paper, we describe our submission to the BabyLM Challenge 2023 shared task on data-efficient language model (LM) pretraining (Warstadt et al., 2023). We train transformer-based masked language models that incorporate unsupervised predictions about hierarchical sentence structure into the model architecture. Concretely, we use the Structformer architecture (Shen et al., 2021) and variants thereof. StructFormer models have been shown to perform well on unsupervised syntactic induction based on limited pretraining data, and to yield performance improvements over a vanilla transformer architecture (Shen et al., 2021). Evaluation of our models on 39 tasks provided by the BabyLM challenge shows promising improvements of models that integrate a hierarchical bias into the architecture at some particular tasks, even though they fail to consistently outperform the RoBERTa baseline model provided by the shared task organizers on all tasks.

Increasing The Performance of Cognitively Inspired Data-Efficient Language Models via Implicit Structure Building

TL;DR

The study examines data-efficient language modeling by injecting implicit hierarchical structure via StructFormer-inspired architectures. Using seven model variants pretrained on a 10M-word BabyLM corpus, the authors evaluate on 39 tasks including BLiMP, SuperGLUE, and MSGS to assess whether hierarchical biases improve performance over a RoBERTa baseline. Results show task-dependent benefits but no consistent overall gains, with the best aggregate model (SR_s1') achieving selection on the shared task, and middle-layer parser placement underperforming in many settings. The work highlights the potential of hierarchical biases but also the need for careful task-wise analysis and more controlled experiments to isolate the conditions under which such biases are beneficial.

Abstract

In this paper, we describe our submission to the BabyLM Challenge 2023 shared task on data-efficient language model (LM) pretraining (Warstadt et al., 2023). We train transformer-based masked language models that incorporate unsupervised predictions about hierarchical sentence structure into the model architecture. Concretely, we use the Structformer architecture (Shen et al., 2021) and variants thereof. StructFormer models have been shown to perform well on unsupervised syntactic induction based on limited pretraining data, and to yield performance improvements over a vanilla transformer architecture (Shen et al., 2021). Evaluation of our models on 39 tasks provided by the BabyLM challenge shows promising improvements of models that integrate a hierarchical bias into the architecture at some particular tasks, even though they fail to consistently outperform the RoBERTa baseline model provided by the shared task organizers on all tasks.
Paper Structure (26 sections, 2 figures, 8 tables)

This paper contains 26 sections, 2 figures, 8 tables.

Figures (2)

  • Figure 1: StructFormer and StructRoBERTa Architectures ($s_1$)
  • Figure 2: In-between Parser Architectures ($s_2$), dotted lines indicate intervening the encoder layers at two positions, where the parser network connects the two split parts of the encoder