Towards smaller, faster decoder-only transformers: Architectural variants and their implications

Sathya Krishnan Suresh; Shunmugapriya P

Towards smaller, faster decoder-only transformers: Architectural variants and their implications

Sathya Krishnan Suresh, Shunmugapriya P

TL;DR

This study introduces three modifications to the decoder-only transformer architecture, namely ParallelGPT (pgpt), LinearGPT (lgpt), and ConvGPT (cgpt), which demonstrate comparable performance to the conventional architecture in language generation, yet benefit from reduced model sizes and faster training processes.

Abstract

In recent times, the research on Large Language Models (LLMs) has grown exponentially, predominantly focusing on models underpinned by the transformer architecture, as established by [1], and further developed through the decoder-only variations by [2]. Contemporary efforts in this field primarily aim to enhance model capabilities by scaling up both the architecture and data volumes utilized during training. However, the exploration into reduce these model sizes while preserving their efficacy remains scant. In this study, we introduce three modifications to the decoder-only transformer architecture, namely ParallelGPT (pgpt), LinearGPT (lgpt), and ConvGPT (cgpt). These variants demonstrate comparable performance to the conventional architecture in language generation, yet benefit from reduced model sizes and faster training processes. We open-source the model weights and the complete codebase for these implementation for further research.

Towards smaller, faster decoder-only transformers: Architectural variants and their implications

TL;DR

Abstract

Paper Structure (16 sections, 7 equations, 4 figures, 4 tables)

This paper contains 16 sections, 7 equations, 4 figures, 4 tables.

Introduction
Related Work
Architectural modifications
ParallelGPT
LinearGPT
ConvGPT
Experimental Setup and Results
Model Configuration
Training
Results
Comparison with the traditional architecture:
Comparison b/w pgpt and pgpt-1:
Conclusion
Limitations
Reduction Potential of the lgpt Architecture
...and 1 more sections

Figures (4)

Figure 1: Traditional GPT architecture
Figure 2: ParallelGPT
Figure 3: LinearGPT and ConvGPT
Figure 4: Loss comparison b/w the 4 models

Towards smaller, faster decoder-only transformers: Architectural variants and their implications

TL;DR

Abstract

Towards smaller, faster decoder-only transformers: Architectural variants and their implications

Authors

TL;DR

Abstract

Table of Contents

Figures (4)