Table of Contents
Fetching ...

Efficient Language Modeling for Low-Resource Settings with Hybrid RNN-Transformer Architectures

Gabriel Lindenmaier, Sean Papay, Sebastian Padó

TL;DR

Transformer models incur high training costs, especially in low-resource settings. The authors propose hybrid architectures that blend recurrent blocks (QRNN/RNN) with selective attention, formalizing four building blocks and three architectures (Attn-QRNN, PAR Transformer, Hybrid Transformer). In experiments on Enwik8 and WikiText-103 with about 41M parameters, the Hybrid Transformer consistently matches or exceeds larger baselines while using fewer parameters, demonstrating improved efficiency. This work suggests a practical path to more scalable language modeling by carefully balancing recurrence and attention, with potential impact on resource-constrained NLP deployments.

Abstract

Transformer-based language models have recently been at the forefront of active research in text generation. However, these models' advances come at the price of prohibitive training costs, with parameter counts in the billions and compute requirements measured in petaflop/s-decades. In this paper, we investigate transformer-based architectures for improving model performance in a low-data regime by selectively replacing attention layers with feed-forward and quasi-recurrent neural network layers. We test these architectures on the standard Enwik8 and Wikitext-103 corpora. Our results show that our reduced architectures outperform existing models with a comparable number of parameters, and obtain comparable performance to larger models while significantly reducing the number of parameters.

Efficient Language Modeling for Low-Resource Settings with Hybrid RNN-Transformer Architectures

TL;DR

Transformer models incur high training costs, especially in low-resource settings. The authors propose hybrid architectures that blend recurrent blocks (QRNN/RNN) with selective attention, formalizing four building blocks and three architectures (Attn-QRNN, PAR Transformer, Hybrid Transformer). In experiments on Enwik8 and WikiText-103 with about 41M parameters, the Hybrid Transformer consistently matches or exceeds larger baselines while using fewer parameters, demonstrating improved efficiency. This work suggests a practical path to more scalable language modeling by carefully balancing recurrence and attention, with potential impact on resource-constrained NLP deployments.

Abstract

Transformer-based language models have recently been at the forefront of active research in text generation. However, these models' advances come at the price of prohibitive training costs, with parameter counts in the billions and compute requirements measured in petaflop/s-decades. In this paper, we investigate transformer-based architectures for improving model performance in a low-data regime by selectively replacing attention layers with feed-forward and quasi-recurrent neural network layers. We test these architectures on the standard Enwik8 and Wikitext-103 corpora. Our results show that our reduced architectures outperform existing models with a comparable number of parameters, and obtain comparable performance to larger models while significantly reducing the number of parameters.

Paper Structure

This paper contains 13 sections, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Building blocks of our models
  • Figure 2: Language model buildup & single PAR transformer and QRNN blocks.
  • Figure 3: The "Sequential Network" part of our three language modeling architecture layouts