An Empirical Evaluation of Encoder Architectures for Fast Real-Time Long Conversational Understanding
Annamalai Senthilnathan, Kristjan Arumae, Mohammed Khalilia, Zhengzheng Xing, Aaron R. Colak
TL;DR
This work tackles real-time long-conversation understanding, where traditional fixed-length Transformers incur high costs due to $O(n^2)$ self-attention. It compares efficient Transformer variants with a CNN-based two-tower Temporal Convolutional Network (TCN) encoder designed to handle arbitrary input length and diverse contextual ranges. Empirical results show CNN-based encoders deliver competitive accuracy for conversational tasks while offering substantial cost and latency benefits, and they remain competitive on the Long Range Arena benchmark across multiple tasks. Overall, CNN-based encoders emerge as a practical, scalable alternative for production systems requiring fast inference on long transcripts, with significant reductions in training time, memory, and latency (e.g., ~2.6x faster training, ~80% faster inference, ~72% more memory efficiency on average).
Abstract
Analyzing long text data such as customer call transcripts is a cost-intensive and tedious task. Machine learning methods, namely Transformers, are leveraged to model agent-customer interactions. Unfortunately, Transformers adhere to fixed-length architectures and their self-attention mechanism scales quadratically with input length. Such limitations make it challenging to leverage traditional Transformers for long sequence tasks, such as conversational understanding, especially in real-time use cases. In this paper we explore and evaluate recently proposed efficient Transformer variants (e.g. Performer, Reformer) and a CNN-based architecture for real-time and near real-time long conversational understanding tasks. We show that CNN-based models are dynamic, ~2.6x faster to train, ~80% faster inference and ~72% more memory efficient compared to Transformers on average. Additionally, we evaluate the CNN model using the Long Range Arena benchmark to demonstrate competitiveness in general long document analysis.
