Table of Contents
Fetching ...

Optimizing Speech Multi-View Feature Fusion through Conditional Computation

Weiqiao Shan, Yuhao Zhang, Yuchen Han, Bei Li, Xiaofeng Zhao, Yuang Li, Min Zhang, Hao Yang, Tong Xiao, Jingbo Zhu

TL;DR

This work proposes a novel generalized feature fusion framework grounded in conditional computation, featuring a gradient-sensitive gating network and a multi-stage dropout strategy that mitigates feature conflicts and bolsters model robustness to multi-view input features.

Abstract

Recent advancements have highlighted the efficacy of self-supervised learning (SSL) features in various speech-related tasks, providing lightweight and versatile multi-view speech representations. However, our study reveals that while SSL features expedite model convergence, they conflict with traditional spectral features like FBanks in terms of update directions. In response, we propose a novel generalized feature fusion framework grounded in conditional computation, featuring a gradient-sensitive gating network and a multi-stage dropout strategy. This framework mitigates feature conflicts and bolsters model robustness to multi-view input features. By integrating SSL and spectral features, our approach accelerates convergence and maintains performance on par with spectral models across multiple speech translation tasks on the MUSTC dataset.

Optimizing Speech Multi-View Feature Fusion through Conditional Computation

TL;DR

This work proposes a novel generalized feature fusion framework grounded in conditional computation, featuring a gradient-sensitive gating network and a multi-stage dropout strategy that mitigates feature conflicts and bolsters model robustness to multi-view input features.

Abstract

Recent advancements have highlighted the efficacy of self-supervised learning (SSL) features in various speech-related tasks, providing lightweight and versatile multi-view speech representations. However, our study reveals that while SSL features expedite model convergence, they conflict with traditional spectral features like FBanks in terms of update directions. In response, we propose a novel generalized feature fusion framework grounded in conditional computation, featuring a gradient-sensitive gating network and a multi-stage dropout strategy. This framework mitigates feature conflicts and bolsters model robustness to multi-view input features. By integrating SSL and spectral features, our approach accelerates convergence and maintains performance on par with spectral models across multiple speech translation tasks on the MUSTC dataset.
Paper Structure (15 sections, 13 equations, 5 figures, 4 tables)

This paper contains 15 sections, 13 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: The BLEU scores of the model using only FBanks or S-feature (a). When we freeze all the model parameters during training and sequentially input the two features, we observe that up to 32% gradients generated by the two features contain conflicting components, and the angle between the two gradients increases as training progresses (b).
  • Figure 2: ST model architecture.
  • Figure 3: The $\cos(\theta)$ between two gradients generated by each feature. We find Instability(left): The $\cos(\theta)$ becomes smaller and smaller with training for $\cos(\theta) > 0$, and Conflicting(right): Percent of conflicting gradients($\cos(\theta) < 0$) among all gradients in an input batch.
  • Figure 4: The mean value of $\mathbf{g}_{\text{fbank}}$ in the model training from scratch (left) and the pre-trained model (right).
  • Figure 5: The final result of our method.