Table of Contents
Fetching ...

How Does Sequence Modeling Architecture Influence Base Capabilities of Pre-trained Language Models? Exploring Key Architecture Design Principles to Avoid Base Capabilities Degradation

Xin Lu, Yanyan Zhao, Si Wei, Shijin Wang, Bing Qin, Ting Liu

TL;DR

This work investigates how sequence modeling architectures shape the base capabilities of pre-trained language models. It introduces a limited-domain pre-training with out-of-distribution testing to reveal architecture-induced gaps that mixed-domain pre-training conceals, showing that stateful architectures degrade base capabilities relative to Transformer. Through ablation and factor analysis, it identifies full-sequence visibility, real relation calculation, and non-uniform attention as determinative factors and proposes the design principle of full-sequence arbitrary selection. The authors validate this principle with the minimalist Top-1 Element Selection and the more practical Top-1 Chunk Selection architectures, demonstrating strong base capabilities and favorable efficiency, thereby offering a principled direction for future architecture design.

Abstract

Pre-trained language models represented by the Transformer have been proven to possess strong base capabilities, and the representative self-attention mechanism in the Transformer has become a classic in sequence modeling architectures. Different from the work of proposing sequence modeling architecture to improve the efficiency of attention mechanism, this work focuses on the impact of sequence modeling architectures on base capabilities. Specifically, our concern is: How exactly do sequence modeling architectures affect the base capabilities of pre-trained language models? In this work, we first point out that the mixed domain pre-training setting commonly adopted in existing architecture design works fails to adequately reveal the differences in base capabilities among various architectures. To address this, we propose a limited domain pre-training setting with out-of-distribution testing, which successfully uncovers significant differences in base capabilities among architectures at an early stage. Next, we analyze the base capabilities of stateful sequence modeling architectures, and find that they exhibit significant degradation in base capabilities compared to the Transformer. Then, through a series of architecture component analysis, we summarize a key architecture design principle: A sequence modeling architecture need possess full-sequence arbitrary selection capability to avoid degradation in base capabilities. Finally, we empirically validate this principle using an extremely simple Top-1 element selection architecture and further generalize it to a more practical Top-1 chunk selection architecture. Experimental results demonstrate our proposed sequence modeling architecture design principle and suggest that our work can serve as a valuable reference for future architecture improvements and novel designs.

How Does Sequence Modeling Architecture Influence Base Capabilities of Pre-trained Language Models? Exploring Key Architecture Design Principles to Avoid Base Capabilities Degradation

TL;DR

This work investigates how sequence modeling architectures shape the base capabilities of pre-trained language models. It introduces a limited-domain pre-training with out-of-distribution testing to reveal architecture-induced gaps that mixed-domain pre-training conceals, showing that stateful architectures degrade base capabilities relative to Transformer. Through ablation and factor analysis, it identifies full-sequence visibility, real relation calculation, and non-uniform attention as determinative factors and proposes the design principle of full-sequence arbitrary selection. The authors validate this principle with the minimalist Top-1 Element Selection and the more practical Top-1 Chunk Selection architectures, demonstrating strong base capabilities and favorable efficiency, thereby offering a principled direction for future architecture design.

Abstract

Pre-trained language models represented by the Transformer have been proven to possess strong base capabilities, and the representative self-attention mechanism in the Transformer has become a classic in sequence modeling architectures. Different from the work of proposing sequence modeling architecture to improve the efficiency of attention mechanism, this work focuses on the impact of sequence modeling architectures on base capabilities. Specifically, our concern is: How exactly do sequence modeling architectures affect the base capabilities of pre-trained language models? In this work, we first point out that the mixed domain pre-training setting commonly adopted in existing architecture design works fails to adequately reveal the differences in base capabilities among various architectures. To address this, we propose a limited domain pre-training setting with out-of-distribution testing, which successfully uncovers significant differences in base capabilities among architectures at an early stage. Next, we analyze the base capabilities of stateful sequence modeling architectures, and find that they exhibit significant degradation in base capabilities compared to the Transformer. Then, through a series of architecture component analysis, we summarize a key architecture design principle: A sequence modeling architecture need possess full-sequence arbitrary selection capability to avoid degradation in base capabilities. Finally, we empirically validate this principle using an extremely simple Top-1 element selection architecture and further generalize it to a more practical Top-1 chunk selection architecture. Experimental results demonstrate our proposed sequence modeling architecture design principle and suggest that our work can serve as a valuable reference for future architecture improvements and novel designs.

Paper Structure

This paper contains 31 sections, 9 figures, 3 tables, 2 algorithms.

Figures (9)

  • Figure 1: Language modeling test results of various sequence modeling architectures under two pre-training settings. (Model parameters$\approx$110M, pre-trained tokens$=$100B and sequence length$=$2k)
  • Figure 2: The Illustration include: (a) and (b) Language modeling test results of various sequence modeling architectures under two pre-training settings. (c) Few-shot learning performance results of these architectures. (Model parameters$\approx$1.3B, pre-trained tokens$=$100B and sequence length$=$2k)
  • Figure 3: Analysis of the influence of various sequence modeling architecture components on base capabilities. (Model parameters$\approx$110M, pre-trained tokens$=$100B or 25B and sequence length$=$2k)
  • Figure 4: The relationship between attention distribution entropy and temperature.
  • Figure 5: The Illustration include: (a) The overall architecture design of the Top-1 Element Selection architecture. (b) The kernel design for key component of the Top-1 Chunk Selection architecture.
  • ...and 4 more figures