Table of Contents
Fetching ...

On the use of Performer and Agent Attention for Spoken Language Identification

Jitendra Kumar dhiman, Jainag Ambati

TL;DR

This study analyzes three attention mechanisms—Self-, Performer-, and Agent-attention—within an attentive statistical pooling framework for Spoken Language Identification (LID) using BEST-RQ frame embeddings. Across VoxPopuli, VoxLingua, and FLEURS, Performer-attention with $r=128$ consistently yields the best accuracy and F1, outperforming self-attention, while Agent-attention provides competitive results with lower memory usage and still linear-time complexity. The results demonstrate that linear-time attention can replace the costly quadratic self-attention in LID pipelines without sacrificing accuracy, suggesting practical benefits for real-time applications. The work highlights potential extensions to speaker identification and emphasizes evaluating inference-time efficiency for deployment scenarios.

Abstract

One of the methods for language Identification (LID) involves deriving speech representation from pre-trained models using self-supervised learning, followed by fine-tuning the model for the LID task. State-of-the-art approaches for LID use an attention-based statistical pooling layer to facilitate the aggregation of contextual information across time frames of the embedding vectors extracted from the pre-trained model. In this paper, we delve into exploring recently proposed attention mechanisms, namely performer and agent-attention, in conjunction with the statistical pooling layer. The LID experiments are performed on three datasets: VoxPopuli, FLEURS, and VoxLingua. We compare their performance against vanilla self-attention. Our findings suggest that performer-attention outperforms self-attention and agent-attention exhibits comparable or occasionally superior performance to self-attention, while also being computationally less expensive.

On the use of Performer and Agent Attention for Spoken Language Identification

TL;DR

This study analyzes three attention mechanisms—Self-, Performer-, and Agent-attention—within an attentive statistical pooling framework for Spoken Language Identification (LID) using BEST-RQ frame embeddings. Across VoxPopuli, VoxLingua, and FLEURS, Performer-attention with consistently yields the best accuracy and F1, outperforming self-attention, while Agent-attention provides competitive results with lower memory usage and still linear-time complexity. The results demonstrate that linear-time attention can replace the costly quadratic self-attention in LID pipelines without sacrificing accuracy, suggesting practical benefits for real-time applications. The work highlights potential extensions to speaker identification and emphasizes evaluating inference-time efficiency for deployment scenarios.

Abstract

One of the methods for language Identification (LID) involves deriving speech representation from pre-trained models using self-supervised learning, followed by fine-tuning the model for the LID task. State-of-the-art approaches for LID use an attention-based statistical pooling layer to facilitate the aggregation of contextual information across time frames of the embedding vectors extracted from the pre-trained model. In this paper, we delve into exploring recently proposed attention mechanisms, namely performer and agent-attention, in conjunction with the statistical pooling layer. The LID experiments are performed on three datasets: VoxPopuli, FLEURS, and VoxLingua. We compare their performance against vanilla self-attention. Our findings suggest that performer-attention outperforms self-attention and agent-attention exhibits comparable or occasionally superior performance to self-attention, while also being computationally less expensive.

Paper Structure

This paper contains 14 sections, 2 equations, 1 figure, 4 tables.

Figures (1)

  • Figure 1: Block diagram for the LID classifier. The BEST-RQ block is pre-trained using self-supervised learning framework followed by fine-tuning of LID classifier.