Table of Contents
Fetching ...

The Effect of Batch Size on Contrastive Self-Supervised Speech Representation Learning

Nik Vaessen, David A. van Leeuwen

TL;DR

This study systematically examines how batch size during pre-training affects contrastive self-supervised speech representations using wav2vec 2.0. By pre-training across a wide range of batch sizes (from 87.5 seconds to 80 minutes) on LibriSpeech and evaluating ASR fine-tuning, it shows that larger batch sizes improve pre-training performance for a fixed number of iterations but exhibit stability limits and diminishing returns. A central finding is that downstream performance is driven primarily by the total amount of speech data seen during self-supervision, i.e., the product of batch size and iterations, rather than batch size alone. The work provides an independent implementation, public checkpoints, and concrete guidance for resource-constrained benchmarking of SSL in speech, highlighting the value of benchmarking with fixed data seen. Overall, the results suggest practical operating ranges and offer a framework for evaluating SSL under academic budgets while preserving interpretability of data exposure effects.

Abstract

Foundation models in speech are often trained using many GPUs, which implicitly leads to large effective batch sizes. In this paper we study the effect of batch size on pre-training, both in terms of statistics that can be monitored during training, and in the effect on the performance of a downstream fine-tuning task. By using batch sizes varying from 87.5 seconds to 80 minutes of speech we show that, for a fixed amount of iterations, larger batch sizes result in better pre-trained models. However, there is lower limit for stability, and an upper limit for effectiveness. We then show that the quality of the pre-trained model depends mainly on the amount of speech data seen during training, i.e., on the product of batch size and number of iterations. All results are produced with an independent implementation of the wav2vec 2.0 architecture, which to a large extent reproduces the results of the original work (arXiv:2006.11477). Our extensions can help researchers choose effective operating conditions when studying self-supervised learning in speech, and hints towards benchmarking self-supervision with a fixed amount of seen data. Code and model checkpoints are available at https://github.com/nikvaessen/w2v2-batch-size.

The Effect of Batch Size on Contrastive Self-Supervised Speech Representation Learning

TL;DR

This study systematically examines how batch size during pre-training affects contrastive self-supervised speech representations using wav2vec 2.0. By pre-training across a wide range of batch sizes (from 87.5 seconds to 80 minutes) on LibriSpeech and evaluating ASR fine-tuning, it shows that larger batch sizes improve pre-training performance for a fixed number of iterations but exhibit stability limits and diminishing returns. A central finding is that downstream performance is driven primarily by the total amount of speech data seen during self-supervision, i.e., the product of batch size and iterations, rather than batch size alone. The work provides an independent implementation, public checkpoints, and concrete guidance for resource-constrained benchmarking of SSL in speech, highlighting the value of benchmarking with fixed data seen. Overall, the results suggest practical operating ranges and offer a framework for evaluating SSL under academic budgets while preserving interpretability of data exposure effects.

Abstract

Foundation models in speech are often trained using many GPUs, which implicitly leads to large effective batch sizes. In this paper we study the effect of batch size on pre-training, both in terms of statistics that can be monitored during training, and in the effect on the performance of a downstream fine-tuning task. By using batch sizes varying from 87.5 seconds to 80 minutes of speech we show that, for a fixed amount of iterations, larger batch sizes result in better pre-trained models. However, there is lower limit for stability, and an upper limit for effectiveness. We then show that the quality of the pre-trained model depends mainly on the amount of speech data seen during training, i.e., on the product of batch size and number of iterations. All results are produced with an independent implementation of the wav2vec 2.0 architecture, which to a large extent reproduces the results of the original work (arXiv:2006.11477). Our extensions can help researchers choose effective operating conditions when studying self-supervised learning in speech, and hints towards benchmarking self-supervision with a fixed amount of seen data. Code and model checkpoints are available at https://github.com/nikvaessen/w2v2-batch-size.
Paper Structure (27 sections, 4 equations, 5 figures, 4 tables)

This paper contains 27 sections, 4 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: A schematic overview of the wav2vec 2.0 framework during self-supervision. Dashed arrows indicate a projection using a linear layer without activation to match a target dimension.
  • Figure 2: Various metrics on validation data (interval of 5 k training steps) during self-supervised pre-training with different batch sizes, namely all three losses (A, B, C), the accuracy of predicting the correct masked quantized vector (D), and the perplexity of codebook 1 (E) and codebook 2 (F). We also show the average, minimum, and maximum value of the cosine similarity between codewords of codebook 1 (G, H, I) and codebook 2 (J,K,L), with an interval of 100 training steps.
  • Figure 3: The WER (left column: librispeech test-clean, right column: librispeech test-other) against the batch size during pre-training of a self-supervised initialization. The self-supervised models are fine-tuned for speech recognition using 5 different magnitudes of labeled data. Scratch indicates fine-tuning a random initialization instead of a self-supervised initialization. The upper row shows the WER with letter decoding, while the bottom row shows the WER with word decoding using a 4-gram language model.
  • Figure 4: The standard deviation of the gradient, averaged over all parameters, against consecutive checkpoints, every 5 k steps during pre-training. The gradients are calculated with 10 random batches, and the size of the batch matches the size used when training the checkpoint. No update steps are applied during these measurements.
  • Figure 5: We plot the WER after fine-tuning against the hours of data processed during self-supervision (upper bound) for different batch sizes. The left column shows WER on LibriSpeech test-clean with 10 minutes of labeled data fine-tuning, the right column with 100 hours of labeled data fine-tuning.