Table of Contents
Fetching ...

TS3-Codec: Transformer-Based Simple Streaming Single Codec

Haibin Wu, Naoyuki Kanda, Sefik Emre Eskimez, Jinyu Li

TL;DR

This work develops TS3-Codec, a purely transformer-based neural audio codec designed for streaming with a single codebook, eliminating convolutional layers to reduce hyperparameter sensitivity and computation. It employs sliding-window left-context attention, a factorized VQ codebook, and GAN-based training to achieve high audio quality at low bitrates, outperforming convolutional baselines under similar compute. Across ~1000 and ~600 bps settings, TS3-Codec demonstrates competitive or superior intelligibility, perceptual quality, and naturalness while using substantially less computation and bitrate, highlighting the viability of transformer-only architectures for streaming NACs. The approach offers a simpler, more efficient path for integrating neural audio codecs with speech language models and real-time applications.

Abstract

Neural audio codecs (NACs) have garnered significant attention as key technologies for audio compression as well as audio representation for speech language models. While mainstream NAC models are predominantly convolution-based, the performance of NACs with a purely transformer-based, and convolution-free architecture remains unexplored. This paper introduces TS3-Codec, a Transformer-Based Simple Streaming Single Codec. TS3-Codec consists of only a stack of transformer layers with a few linear layers, offering greater simplicity and expressiveness by fully eliminating convolution layers that require careful hyperparameter tuning and large computations. Under the streaming setup, the proposed TS3-Codec achieves comparable or superior performance compared to the codec with state-of-the-art convolution-based architecture while requiring only 12% of the computation and 77% of bitrate. Furthermore, it significantly outperforms the convolution-based codec when using similar computational resources.

TS3-Codec: Transformer-Based Simple Streaming Single Codec

TL;DR

This work develops TS3-Codec, a purely transformer-based neural audio codec designed for streaming with a single codebook, eliminating convolutional layers to reduce hyperparameter sensitivity and computation. It employs sliding-window left-context attention, a factorized VQ codebook, and GAN-based training to achieve high audio quality at low bitrates, outperforming convolutional baselines under similar compute. Across ~1000 and ~600 bps settings, TS3-Codec demonstrates competitive or superior intelligibility, perceptual quality, and naturalness while using substantially less computation and bitrate, highlighting the viability of transformer-only architectures for streaming NACs. The approach offers a simpler, more efficient path for integrating neural audio codecs with speech language models and real-time applications.

Abstract

Neural audio codecs (NACs) have garnered significant attention as key technologies for audio compression as well as audio representation for speech language models. While mainstream NAC models are predominantly convolution-based, the performance of NACs with a purely transformer-based, and convolution-free architecture remains unexplored. This paper introduces TS3-Codec, a Transformer-Based Simple Streaming Single Codec. TS3-Codec consists of only a stack of transformer layers with a few linear layers, offering greater simplicity and expressiveness by fully eliminating convolution layers that require careful hyperparameter tuning and large computations. Under the streaming setup, the proposed TS3-Codec achieves comparable or superior performance compared to the codec with state-of-the-art convolution-based architecture while requiring only 12% of the computation and 77% of bitrate. Furthermore, it significantly outperforms the convolution-based codec when using similar computational resources.

Paper Structure

This paper contains 26 sections, 3 figures, 4 tables.

Figures (3)

  • Figure 1: The framework of the proposed TS3-Codec. The transformer layer employs a sliding window attention on the left context, enabling the model to function in a streaming manner with linear (rather than quadratic) complexity relative to the waveform length.
  • Figure 2: Comparison between BigCodec-S and TS3-Codec (Bitrate $\approx$ 1000 bps). To enhance visualization, the y-axes for WER and MCD are inverted, so that model points in the upper-left corner exhibit the best performance with the least computational cost. ngf is a factor related to the model size of BigCodec-S.
  • Figure 3: Comparison between BigCodec-S and TS3-Codec (Bitrate $\approx$ 600 bps). To enhance visualization, the y-axes for WER and MCD are inverted, so that model points in the upper-left corner exhibit the best performance with the least computational cost.