Table of Contents
Fetching ...

An Empirical Evaluation of Encoder Architectures for Fast Real-Time Long Conversational Understanding

Annamalai Senthilnathan, Kristjan Arumae, Mohammed Khalilia, Zhengzheng Xing, Aaron R. Colak

TL;DR

This work tackles real-time long-conversation understanding, where traditional fixed-length Transformers incur high costs due to $O(n^2)$ self-attention. It compares efficient Transformer variants with a CNN-based two-tower Temporal Convolutional Network (TCN) encoder designed to handle arbitrary input length and diverse contextual ranges. Empirical results show CNN-based encoders deliver competitive accuracy for conversational tasks while offering substantial cost and latency benefits, and they remain competitive on the Long Range Arena benchmark across multiple tasks. Overall, CNN-based encoders emerge as a practical, scalable alternative for production systems requiring fast inference on long transcripts, with significant reductions in training time, memory, and latency (e.g., ~2.6x faster training, ~80% faster inference, ~72% more memory efficiency on average).

Abstract

Analyzing long text data such as customer call transcripts is a cost-intensive and tedious task. Machine learning methods, namely Transformers, are leveraged to model agent-customer interactions. Unfortunately, Transformers adhere to fixed-length architectures and their self-attention mechanism scales quadratically with input length. Such limitations make it challenging to leverage traditional Transformers for long sequence tasks, such as conversational understanding, especially in real-time use cases. In this paper we explore and evaluate recently proposed efficient Transformer variants (e.g. Performer, Reformer) and a CNN-based architecture for real-time and near real-time long conversational understanding tasks. We show that CNN-based models are dynamic, ~2.6x faster to train, ~80% faster inference and ~72% more memory efficient compared to Transformers on average. Additionally, we evaluate the CNN model using the Long Range Arena benchmark to demonstrate competitiveness in general long document analysis.

An Empirical Evaluation of Encoder Architectures for Fast Real-Time Long Conversational Understanding

TL;DR

This work tackles real-time long-conversation understanding, where traditional fixed-length Transformers incur high costs due to self-attention. It compares efficient Transformer variants with a CNN-based two-tower Temporal Convolutional Network (TCN) encoder designed to handle arbitrary input length and diverse contextual ranges. Empirical results show CNN-based encoders deliver competitive accuracy for conversational tasks while offering substantial cost and latency benefits, and they remain competitive on the Long Range Arena benchmark across multiple tasks. Overall, CNN-based encoders emerge as a practical, scalable alternative for production systems requiring fast inference on long transcripts, with significant reductions in training time, memory, and latency (e.g., ~2.6x faster training, ~80% faster inference, ~72% more memory efficiency on average).

Abstract

Analyzing long text data such as customer call transcripts is a cost-intensive and tedious task. Machine learning methods, namely Transformers, are leveraged to model agent-customer interactions. Unfortunately, Transformers adhere to fixed-length architectures and their self-attention mechanism scales quadratically with input length. Such limitations make it challenging to leverage traditional Transformers for long sequence tasks, such as conversational understanding, especially in real-time use cases. In this paper we explore and evaluate recently proposed efficient Transformer variants (e.g. Performer, Reformer) and a CNN-based architecture for real-time and near real-time long conversational understanding tasks. We show that CNN-based models are dynamic, ~2.6x faster to train, ~80% faster inference and ~72% more memory efficient compared to Transformers on average. Additionally, we evaluate the CNN model using the Long Range Arena benchmark to demonstrate competitiveness in general long document analysis.

Paper Structure

This paper contains 28 sections, 4 figures, 12 tables.

Figures (4)

  • Figure 1: Our CNN model with TCN at its core. Here, TCN learns from bi-directional context, t is the current token and T is the max sequence length of the conversation, w is the word embedding (n-dim vector each), h is the latent representation (m-dim vector each) of the respective token, and L denotes the number of layers.
  • Figure 2: An excerpt from the original ABCD dataset, highlighting the action labels and their relevant entities
  • Figure 3: Repurposed ABCD dataset's conversation and utterance label distributions
  • Figure 4: Proprietary dataset's conversation and utterance label distributions