How do Hyenas deal with Human Speech? Speech Recognition and Translation with ConfHyena
Marco Gaido, Sara Papi, Matteo Negri, Luisa Bentivogli
TL;DR
This work tackles the quadratic complexity of self-attention for long speech sequences by adapting the Hyena operator into speech models. The authors introduce ConfHyena (non-causal Hyena in all encoder layers) and Hybrid ConfHyena (non-causal Hyena in early layers with subsequent Conformer-style processing), built on the Conformer architecture. On English ASR and English→8 language ST with MuST-C v1.0, Hybrid ConfHyena achieves a ~27% reduction in training time with ~1% degradation in quality relative to Conformer, while preserving comparable performance to the baseline across tasks; non-causal Hyena yields the best quality among Hyena variants. The results demonstrate that sub-quadratic, non-causal Hyena-based encoders can substantially reduce compute for long speech inputs without committing large losses in accuracy, though generalization to larger datasets and broader domains remains an important future step, alongside practical considerations such as hardware-dependent optimizations and ethical implications of model efficiency.
Abstract
The attention mechanism, a cornerstone of state-of-the-art neural models, faces computational hurdles in processing long sequences due to its quadratic complexity. Consequently, research efforts in the last few years focused on finding more efficient alternatives. Among them, Hyena (Poli et al., 2023) stands out for achieving competitive results in both language modeling and image classification, while offering sub-quadratic memory and computational complexity. Building on these promising results, we propose ConfHyena, a Conformer whose encoder self-attentions are replaced with an adaptation of Hyena for speech processing, where the long input sequences cause high computational costs. Through experiments in automatic speech recognition (for English) and translation (from English into 8 target languages), we show that our best ConfHyena model significantly reduces the training time by 27%, at the cost of minimal quality degradation (~1%), which, in most cases, is not statistically significant.
