Table of Contents
Fetching ...

InhibiDistilbert: Knowledge Distillation for a ReLU and Addition-based Transformer

Tony Zhang, Rickard Brännvall

TL;DR

InhibiDistilbert investigates ReLU and addition-based inhibitor attention as a lighter alternative to matrix multiplication and softmax-based attention in transformers, implemented via knowledge distillation on DistilBERT. It introduces per-head learnable scalars $\gamma$, $\eta$, and $\delta$ to calibrate inhibition and explores two distillation regimes (task-agnostic and task-specific) to train the inhibitor Transformer, aiming to preserve NLP performance while reducing compute. Empirical results show competitive GLUE performance with a modest average drop and comparable IMDB results, though task-specific KD underperforms across several GLUE tasks; energy-efficiency gains are not clearly demonstrated on conventional hardware, suggesting a need for specialized hardware or further optimization. The work outlines practical limitations and directions for hardware-aware deployment, quantization, and broader evaluations across architectures and modalities.

Abstract

This work explores optimizing transformer-based language models by integrating model compression techniques with inhibitor attention, a novel alternative attention mechanism. Inhibitor attention employs Manhattan distances and ReLU activations instead of the matrix multiplications and softmax activation of the conventional scaled dot-product attention. This shift offers potential computational and energy savings while maintaining model effectiveness. We propose further adjustments to improve the inhibitor mechanism's training efficiency and evaluate its performance on the DistilBERT architecture. Our knowledge distillation experiments indicate that the modified inhibitor transformer model can achieve competitive performance on standard NLP benchmarks, including General Language Understanding Evaluation (GLUE) and sentiment analysis tasks.

InhibiDistilbert: Knowledge Distillation for a ReLU and Addition-based Transformer

TL;DR

InhibiDistilbert investigates ReLU and addition-based inhibitor attention as a lighter alternative to matrix multiplication and softmax-based attention in transformers, implemented via knowledge distillation on DistilBERT. It introduces per-head learnable scalars , , and to calibrate inhibition and explores two distillation regimes (task-agnostic and task-specific) to train the inhibitor Transformer, aiming to preserve NLP performance while reducing compute. Empirical results show competitive GLUE performance with a modest average drop and comparable IMDB results, though task-specific KD underperforms across several GLUE tasks; energy-efficiency gains are not clearly demonstrated on conventional hardware, suggesting a need for specialized hardware or further optimization. The work outlines practical limitations and directions for hardware-aware deployment, quantization, and broader evaluations across architectures and modalities.

Abstract

This work explores optimizing transformer-based language models by integrating model compression techniques with inhibitor attention, a novel alternative attention mechanism. Inhibitor attention employs Manhattan distances and ReLU activations instead of the matrix multiplications and softmax activation of the conventional scaled dot-product attention. This shift offers potential computational and energy savings while maintaining model effectiveness. We propose further adjustments to improve the inhibitor mechanism's training efficiency and evaluate its performance on the DistilBERT architecture. Our knowledge distillation experiments indicate that the modified inhibitor transformer model can achieve competitive performance on standard NLP benchmarks, including General Language Understanding Evaluation (GLUE) and sentiment analysis tasks.

Paper Structure

This paper contains 2 sections, 3 equations, 4 tables.