Table of Contents
Fetching ...

Scavenging Hyena: Distilling Transformers into Long Convolution Models

Tokiniaina Raharison Ralambomihanta, Shahrad Mohammadzadeh, Mohammad Sami Nur Islam, Wassim Jabbour, Laurence Liang

TL;DR

The paper tackles the high cost and limited context length of traditional LLM pre-training by distilling a transformer into a long-convolution Hyena model, replacing attention with a subquadratic Hyena operator. It introduces progressive knowledge transfer to transfer layer-wise activations from a teacher to a Hyena-based student, demonstrating that this cross-architecture distillation can achieve competitive perplexities and downstream task performance with substantially reduced training requirements. Experiments on a 70M-parameter GPT-NeoX-like setup show that Hyena-student models distilled from attention-based teachers can match or exceed pre-training results in some cases, while offering better scalability to long contexts. The work advances sustainable AI by combining efficient long-context modeling with cross-architecture knowledge transfer, suggesting a practical path toward greener, faster LLM training.

Abstract

The rapid evolution of Large Language Models (LLMs), epitomized by architectures like GPT-4, has reshaped the landscape of natural language processing. This paper introduces a pioneering approach to address the efficiency concerns associated with LLM pre-training, proposing the use of knowledge distillation for cross-architecture transfer. Leveraging insights from the efficient Hyena mechanism, our method replaces attention heads in transformer models by Hyena, offering a cost-effective alternative to traditional pre-training while confronting the challenge of processing long contextual information, inherent in quadratic attention mechanisms. Unlike conventional compression-focused methods, our technique not only enhances inference speed but also surpasses pre-training in terms of both accuracy and efficiency. In the era of evolving LLMs, our work contributes to the pursuit of sustainable AI solutions, striking a balance between computational power and environmental impact.

Scavenging Hyena: Distilling Transformers into Long Convolution Models

TL;DR

The paper tackles the high cost and limited context length of traditional LLM pre-training by distilling a transformer into a long-convolution Hyena model, replacing attention with a subquadratic Hyena operator. It introduces progressive knowledge transfer to transfer layer-wise activations from a teacher to a Hyena-based student, demonstrating that this cross-architecture distillation can achieve competitive perplexities and downstream task performance with substantially reduced training requirements. Experiments on a 70M-parameter GPT-NeoX-like setup show that Hyena-student models distilled from attention-based teachers can match or exceed pre-training results in some cases, while offering better scalability to long contexts. The work advances sustainable AI by combining efficient long-context modeling with cross-architecture knowledge transfer, suggesting a practical path toward greener, faster LLM training.

Abstract

The rapid evolution of Large Language Models (LLMs), epitomized by architectures like GPT-4, has reshaped the landscape of natural language processing. This paper introduces a pioneering approach to address the efficiency concerns associated with LLM pre-training, proposing the use of knowledge distillation for cross-architecture transfer. Leveraging insights from the efficient Hyena mechanism, our method replaces attention heads in transformer models by Hyena, offering a cost-effective alternative to traditional pre-training while confronting the challenge of processing long contextual information, inherent in quadratic attention mechanisms. Unlike conventional compression-focused methods, our technique not only enhances inference speed but also surpasses pre-training in terms of both accuracy and efficiency. In the era of evolving LLMs, our work contributes to the pursuit of sustainable AI solutions, striking a balance between computational power and environmental impact.
Paper Structure (21 sections, 8 equations, 2 figures, 5 tables)

This paper contains 21 sections, 8 equations, 2 figures, 5 tables.

Figures (2)

  • Figure 1: (A) GPT NEO X Layer Architecture: 6 layers of stacked Attention and MLPs in the 70M GPT NEO X. (B) Hyena-Distilled NEO GPT X Layer Architecture: Replacement of attention heads by the Hyena operator for the distillation task. (C) A visual representation of the attention operator, adapted from Attention_is_all_you_need. (D) A visual representation of the Hyena operator, adapted from poli2023hyena.
  • Figure 2: Progressive knowledge transfer on a Pythia model on its decoder layers. Adapted from sun2020mobilebert.