Integrating Self-supervised Speech Model with Pseudo Word-level Targets from Visually-grounded Speech Model

Hung-Chieh Fang; Nai-Xuan Ye; Yi-Jen Shih; Puyuan Peng; Hsuan-Fu Wang; Layne Berry; Hung-yi Lee; David Harwath

Integrating Self-supervised Speech Model with Pseudo Word-level Targets from Visually-grounded Speech Model

Hung-Chieh Fang, Nai-Xuan Ye, Yi-Jen Shih, Puyuan Peng, Hsuan-Fu Wang, Layne Berry, Hung-yi Lee, David Harwath

TL;DR

This work tackles the semantic gap in frame-level self-supervised speech models by introducing PW-HuBERT, which injects pseudo word-level targets derived from a visually-grounded speech model (VG-HuBERT) into HuBERT pretraining without requiring speech-text data. It presents two architectures—Single PW-HuBERT and Hierarchical PW-HuBERT—where word-level targets are generated from unsupervised word boundaries, pooled, clustered, and aligned to the input sequence, with a joint frame-level objective in the hierarchical variant. Across SLU benchmarks (SLUE, SLUE Phase-2, SNIPS) and semantic tasks (ZeroSpeech 2021 semantics), PW-HuBERT variants consistently improve semantic understanding, with the hierarchical model often delivering the strongest results; oracle boundaries offer limited gains, suggesting attention-derived boundaries are more informative. The study also shows that combining frame-level and word-level signals and freezing HuBERT weights can yield efficient, robust improvements, highlighting a practical path to richer semantic representations in speech SSL without labeled data.

Abstract

Recent advances in self-supervised speech models have shown significant improvement in many downstream tasks. However, these models predominantly centered on frame-level training objectives, which can fall short in spoken language understanding tasks that require semantic comprehension. Existing works often rely on additional speech-text data as intermediate targets, which is costly in the real-world setting. To address this challenge, we propose Pseudo-Word HuBERT (PW-HuBERT), a framework that integrates pseudo word-level targets into the training process, where the targets are derived from a visually-ground speech model, notably eliminating the need for speech-text paired data. Our experimental results on four spoken language understanding (SLU) benchmarks suggest the superiority of our model in capturing semantic information.

Integrating Self-supervised Speech Model with Pseudo Word-level Targets from Visually-grounded Speech Model

TL;DR

Abstract

Paper Structure (18 sections, 2 equations, 1 figure, 4 tables)

This paper contains 18 sections, 2 equations, 1 figure, 4 tables.

Introduction
Method
Preliminaries
Word-Boundary and Pseudo Target Generation
Single PW-HuBERT
Hierarchical PW-HuBERT
Experiment
Datasets
Baselines
Implementation Details
Results and Analysis
Main Results
Comparison with Oracle Setting
Ablation Studies
The Effect of Frame-level Targets
...and 3 more sections

Figures (1)

Figure 1: Overview of pseudo word-level target generation and PW-HuBERT models. (a) With the word segmentations from VG-HuBERT peng2022word, we apply mean-pooling on the representations within the same segment. A K-means model is used to predict the cluster ID of each segment, which is then duplicated to match the length of the corresponding segment. (b) The HuBERT weights are frozen, and the features from each layer are passed to a learnable weighted sum layer to get the final representation. The representation is then passed to another two transformer layers to predict pseudo word-level targets. (c) Hierarchical PW-HuBERT predicts frame-level targets at the 12th layer and predicts pseudo word-level targets at the 14th layer.

Integrating Self-supervised Speech Model with Pseudo Word-level Targets from Visually-grounded Speech Model

TL;DR

Abstract

Integrating Self-supervised Speech Model with Pseudo Word-level Targets from Visually-grounded Speech Model

Authors

TL;DR

Abstract

Table of Contents

Figures (1)