Scavenging Hyena: Distilling Transformers into Long Convolution Models

Tokiniaina Raharison Ralambomihanta; Shahrad Mohammadzadeh; Mohammad Sami Nur Islam; Wassim Jabbour; Laurence Liang

Scavenging Hyena: Distilling Transformers into Long Convolution Models

Tokiniaina Raharison Ralambomihanta, Shahrad Mohammadzadeh, Mohammad Sami Nur Islam, Wassim Jabbour, Laurence Liang

TL;DR

The paper tackles the high cost and limited context length of traditional LLM pre-training by distilling a transformer into a long-convolution Hyena model, replacing attention with a subquadratic Hyena operator. It introduces progressive knowledge transfer to transfer layer-wise activations from a teacher to a Hyena-based student, demonstrating that this cross-architecture distillation can achieve competitive perplexities and downstream task performance with substantially reduced training requirements. Experiments on a 70M-parameter GPT-NeoX-like setup show that Hyena-student models distilled from attention-based teachers can match or exceed pre-training results in some cases, while offering better scalability to long contexts. The work advances sustainable AI by combining efficient long-context modeling with cross-architecture knowledge transfer, suggesting a practical path toward greener, faster LLM training.

Abstract

The rapid evolution of Large Language Models (LLMs), epitomized by architectures like GPT-4, has reshaped the landscape of natural language processing. This paper introduces a pioneering approach to address the efficiency concerns associated with LLM pre-training, proposing the use of knowledge distillation for cross-architecture transfer. Leveraging insights from the efficient Hyena mechanism, our method replaces attention heads in transformer models by Hyena, offering a cost-effective alternative to traditional pre-training while confronting the challenge of processing long contextual information, inherent in quadratic attention mechanisms. Unlike conventional compression-focused methods, our technique not only enhances inference speed but also surpasses pre-training in terms of both accuracy and efficiency. In the era of evolving LLMs, our work contributes to the pursuit of sustainable AI solutions, striking a balance between computational power and environmental impact.

Scavenging Hyena: Distilling Transformers into Long Convolution Models

TL;DR

Abstract

Paper Structure (21 sections, 8 equations, 2 figures, 5 tables)

This paper contains 21 sections, 8 equations, 2 figures, 5 tables.

Introduction
Background
Self Attention Mechanism
Subquadratic Attention Replacements
Distillation
Progressive Knowledge Transfer.
Methods
Hyena Operator
Model
Distillation Procedure
Training Dataset and Procedure
Language Modeling Results
Perplexity Scores
Language Evaluation
Discussion
...and 6 more sections

Figures (2)

Figure 1: (A) GPT NEO X Layer Architecture: 6 layers of stacked Attention and MLPs in the 70M GPT NEO X. (B) Hyena-Distilled NEO GPT X Layer Architecture: Replacement of attention heads by the Hyena operator for the distillation task. (C) A visual representation of the attention operator, adapted from Attention_is_all_you_need. (D) A visual representation of the Hyena operator, adapted from poli2023hyena.
Figure 2: Progressive knowledge transfer on a Pythia model on its decoder layers. Adapted from sun2020mobilebert.

Scavenging Hyena: Distilling Transformers into Long Convolution Models

TL;DR

Abstract

Scavenging Hyena: Distilling Transformers into Long Convolution Models

Authors

TL;DR

Abstract

Table of Contents

Figures (2)