Table of Contents
Fetching ...

DeepCoT: Deep Continual Transformers for Real-Time Inference on Data Streams

Ginés Carreto Picón, Peng Yuan Zhou, Qi Zhang, Alexandros Iosifidis

TL;DR

DeepCoT tackles the need for low-latency, real-time inference on streaming data by enabling deep encoder stacks to operate with continual inference. It uses Single Output attention with a KV-cache memory, removing per-step recomputation so each layer costs $O(n d)$ and the memory preserves past keys/values without updates. A theoretical framework compares base Transformer attention and DeepCoT, showing linear cost, an expanded receptive field up to $l(n-1)$ past tokens, and the effect of using the SOFT activation while removing FFN nonlinearity and LayerNorm. Empirically, DeepCoT achieves competitive accuracy across audio, video, and text domains while delivering speedups up to about two orders of magnitude over prior efficient continual models, enabling practical real-time inference on data streams.

Abstract

Transformer-based models have dramatically increased their size and parameter count to tackle increasingly complex tasks. At the same time, there is a growing demand for low-latency inference on resource-constrained devices that achieves high performance. In particular, stream data inference is typically performed over a sliding temporal window, leading to highly redundant computations. The recent Continual Transformers have addressed this issue, but they can only be effectively used in shallow models, which limits their scope and generalization power. In this paper, we propose the Deep Continual Transformer (DeepCoT), a redundancy-free encoder-only model that can be applied over existing deep encoder architectures with minimal changes. In our experiments over audio, video, and text streams, we show that DeepCoTs retain comparative performance to their non-continual baselines while offering a linear computational cost for all Transformer layers, which reduces up to two orders of magnitude in the running time compared to previous efficient models.

DeepCoT: Deep Continual Transformers for Real-Time Inference on Data Streams

TL;DR

DeepCoT tackles the need for low-latency, real-time inference on streaming data by enabling deep encoder stacks to operate with continual inference. It uses Single Output attention with a KV-cache memory, removing per-step recomputation so each layer costs and the memory preserves past keys/values without updates. A theoretical framework compares base Transformer attention and DeepCoT, showing linear cost, an expanded receptive field up to past tokens, and the effect of using the SOFT activation while removing FFN nonlinearity and LayerNorm. Empirically, DeepCoT achieves competitive accuracy across audio, video, and text domains while delivering speedups up to about two orders of magnitude over prior efficient continual models, enabling practical real-time inference on data streams.

Abstract

Transformer-based models have dramatically increased their size and parameter count to tackle increasingly complex tasks. At the same time, there is a growing demand for low-latency inference on resource-constrained devices that achieves high performance. In particular, stream data inference is typically performed over a sliding temporal window, leading to highly redundant computations. The recent Continual Transformers have addressed this issue, but they can only be effectively used in shallow models, which limits their scope and generalization power. In this paper, we propose the Deep Continual Transformer (DeepCoT), a redundancy-free encoder-only model that can be applied over existing deep encoder architectures with minimal changes. In our experiments over audio, video, and text streams, we show that DeepCoTs retain comparative performance to their non-continual baselines while offering a linear computational cost for all Transformer layers, which reduces up to two orders of magnitude in the running time compared to previous efficient models.

Paper Structure

This paper contains 27 sections, 11 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Average latency observed with different window sizes (batch size of 16). Latency of our DeepCoT models increases linearly with respect to the input window size ($n$), with a negligible cost increase as the window size grows. The details of this experiment can be found in \ref{['sc:runtime']}.
  • Figure 1: Average latency (seconds per token) observed with different window sizes.
  • Figure 2: Overview of the attention mechanism of a DeepCoT layer. The dashed lines indicate the temporal order in which tokens are shifted during Continual Inference.
  • Figure 2: Average throughput (tokens per second) observed with different window sizes.
  • Figure 3: Visualization of the tokens used to compute every output of a two-layer encoder architecture with $n=4$. Notice how stacking multiple encoder layers extends the effective temporal receptive field.