Table of Contents
Fetching ...

Effective Context in Neural Speech Models

Yen Meng, Sharon Goldwater, Hao Tang

TL;DR

This work proposes two approaches to measuring the effective context, and uses them to analyze different speech Transformers and shows that HuBERT can be run in streaming mode without modification to the architecture and without further fine-tuning.

Abstract

Modern neural speech models benefit from having longer context, and many approaches have been proposed to increase the maximum context a model can use. However, few have attempted to measure how much context these models actually use, i.e., the effective context. Here, we propose two approaches to measuring the effective context, and use them to analyze different speech Transformers. For supervised models, we find that the effective context correlates well with the nature of the task, with fundamental frequency tracking, phone classification, and word classification requiring increasing amounts of effective context. For self-supervised models, we find that effective context increases mainly in the early layers, and remains relatively short -- similar to the supervised phone model. Given that these models do not use a long context during prediction, we show that HuBERT can be run in streaming mode without modification to the architecture and without further fine-tuning.

Effective Context in Neural Speech Models

TL;DR

This work proposes two approaches to measuring the effective context, and uses them to analyze different speech Transformers and shows that HuBERT can be run in streaming mode without modification to the architecture and without further fine-tuning.

Abstract

Modern neural speech models benefit from having longer context, and many approaches have been proposed to increase the maximum context a model can use. However, few have attempted to measure how much context these models actually use, i.e., the effective context. Here, we propose two approaches to measuring the effective context, and use them to analyze different speech Transformers. For supervised models, we find that the effective context correlates well with the nature of the task, with fundamental frequency tracking, phone classification, and word classification requiring increasing amounts of effective context. For self-supervised models, we find that effective context increases mainly in the early layers, and remains relatively short -- similar to the supervised phone model. Given that these models do not use a long context during prediction, we show that HuBERT can be run in streaming mode without modification to the architecture and without further fine-tuning.

Paper Structure

This paper contains 9 sections, 4 equations, 5 figures.

Figures (5)

  • Figure 1: Examples illustrating influence. Left: The x-axis is the time point $\tau$ of the input utterance $x$, and the y-axis is the calculated influence value $s(t, \tau)$ for the timepoint $t=300$ (red dot). Right: The relative influence$S(\sigma)$ (normalized), where the x-axis is the time shift to the center frame.
  • Figure 2: (a) Relative influence in the final layers of supervised 6-layer Transformer models trained for different tasks. The y-axis is on a log scale and the x-axis is only shown between $\pm0.7s$, although the relative influence values were computed with a window size of 5 seconds on both sides. Dots show the heights of the center peaks. (b) As in (a) but for different layers of HuBERT. (c) Contextualization of different models and layers. The horizontal lines are values of the supervised models on the final layer.
  • Figure 3: The change of output in terms of the $\ell_2$ distance (left) and the phone error rates (right), as we vary the window size of input to HuBERT (different coloured lines).
  • Figure 4: Relation between contextualization and probing performance on phones (left) and words (right). Each point represents a layer of a model. The last layer of the supervised 6-layer Transformer is annotated.
  • Figure 5: Results of different streaming settings. The dashed line represents the probing error rate with full context. We vary the number of lookahead with unlimited history (right), and vary the number of history with different amounts of lookahead (left).