Attention or Convolution: Transformer Encoders in Audio Language Models for Inference Efficiency
Sungho Jeon, Ching-Feng Yeh, Hakan Inan, Wei-Ning Hsu, Rashi Rungta, Yashar Mehdad, Daniel Bikel
TL;DR
The paper addresses the inference efficiency of self-supervised audio language models and investigates whether complex speech-transformer encoders (Conformer/Squeezeformer) are necessary for efficiency or if a simple self-attention encoder suffices. It evaluates HuBERT-based pre-training with three encoder candidates (Conformer, Squeezeformer, Sparseformer) and a Robustly Binarized Transformer (BiT) quantization approach, profiling compute and storage while testing on the SUPERB benchmark. The findings show that Conformer/Squeezeformer can reduce resource usage, but a purely self-attention encoder can achieve comparable efficiency, especially when combined with extreme 1-bit quantization, albeit with some performance degradation in ASR and other tasks. The results highlight that architecture shape and quantization interactions critically influence practical efficiency, guiding design choices for on-device audio models toward simpler self-attention architectures under quantization constraints. Overall, the work provides practical guidance for deploying efficient, self-supervised audio LMs by balancing encoder design with quantization strategies.
Abstract
In this paper, we show that a simple self-supervised pre-trained audio model can achieve comparable inference efficiency to more complicated pre-trained models with speech transformer encoders. These speech transformers rely on mixing convolutional modules with self-attention modules. They achieve state-of-the-art performance on ASR with top efficiency. We first show that employing these speech transformers as an encoder significantly improves the efficiency of pre-trained audio models as well. However, our study shows that we can achieve comparable efficiency with advanced self-attention solely. We demonstrate that this simpler approach is particularly beneficial with a low-bit weight quantization technique of a neural network to improve efficiency. We hypothesize that it prevents propagating the errors between different quantized modules compared to recent speech transformers mixing quantized convolution and the quantized self-attention modules.
