Table of Contents
Fetching ...

Evaluating Self-Supervised Speech Models via Text-Based LLMS

Takashi Maekaku, Keita Goto, Jinchuan Tian, Yusuke Shinohara, Shinji Watanabe

TL;DR

The paper tackles the challenge of evaluating self-supervised speech models without task-specific training by introducing a text-based large language model metric. It discretizes SSL representations into token sequences, then uses in-context prompts with an LLM to compute the mean log-likelihood (MLL) of the target sequences, providing a label-free proxy for downstream performance. Experiments show that MLL correlates with ASR performance across multiple SSL encoders and LLMs, and that LLM-derived embeddings can improve speaker verification. The approach is robust to prompt templates and benefits from longer input contexts, offering a scalable, training-free tool for SSL model analysis with practical implications for ASR and SV tasks.

Abstract

Self-Supervised Learning (SSL) has gained traction for its ability to learn rich representations with low labeling costs, applicable across diverse downstream tasks. However, assessing the downstream-task performance remains challenging due to the cost of extra training and evaluation. Existing methods for task-agnostic evaluation also require extra training or hyperparameter tuning. We propose a novel evaluation metric using large language models (LLMs). By inputting discrete token sequences and minimal domain cues derived from SSL models into LLMs, we obtain the mean log-likelihood; these cues guide in-context learning, rendering the score more reliable without extra training or hyperparameter tuning. Experimental results show a correlation between LLM-based scores and automatic speech recognition task. Additionally, our findings reveal that LLMs not only functions as an SSL evaluation tools but also provides inference-time embeddings that are useful for speaker verification task.

Evaluating Self-Supervised Speech Models via Text-Based LLMS

TL;DR

The paper tackles the challenge of evaluating self-supervised speech models without task-specific training by introducing a text-based large language model metric. It discretizes SSL representations into token sequences, then uses in-context prompts with an LLM to compute the mean log-likelihood (MLL) of the target sequences, providing a label-free proxy for downstream performance. Experiments show that MLL correlates with ASR performance across multiple SSL encoders and LLMs, and that LLM-derived embeddings can improve speaker verification. The approach is robust to prompt templates and benefits from longer input contexts, offering a scalable, training-free tool for SSL model analysis with practical implications for ASR and SV tasks.

Abstract

Self-Supervised Learning (SSL) has gained traction for its ability to learn rich representations with low labeling costs, applicable across diverse downstream tasks. However, assessing the downstream-task performance remains challenging due to the cost of extra training and evaluation. Existing methods for task-agnostic evaluation also require extra training or hyperparameter tuning. We propose a novel evaluation metric using large language models (LLMs). By inputting discrete token sequences and minimal domain cues derived from SSL models into LLMs, we obtain the mean log-likelihood; these cues guide in-context learning, rendering the score more reliable without extra training or hyperparameter tuning. Experimental results show a correlation between LLM-based scores and automatic speech recognition task. Additionally, our findings reveal that LLMs not only functions as an SSL evaluation tools but also provides inference-time embeddings that are useful for speaker verification task.

Paper Structure

This paper contains 18 sections, 4 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Relationship between context size (number of preceding and succeeding utterances) and the MLL.
  • Figure 2: Layer-wise MLL comparison for the three SSL models. MLLs are computed from the outputs of three LLMs (Gemma3-4b, Qwen3-4b, Phi-4-mini). The left y-axis reports scores obtained with Phi-4-mini, whereas the right y-axis reports scores obtained with sub-models Gemma3-4b and Qwen3-4b.
  • Figure 3: EER comparison across layers of three SSL models. The final-layer hidden representations from Gemma3-4B are used to perform SV task, and the resulting EER is recorded.