Optimizing Speech Multi-View Feature Fusion through Conditional Computation

Weiqiao Shan; Yuhao Zhang; Yuchen Han; Bei Li; Xiaofeng Zhao; Yuang Li; Min Zhang; Hao Yang; Tong Xiao; Jingbo Zhu

Optimizing Speech Multi-View Feature Fusion through Conditional Computation

Weiqiao Shan, Yuhao Zhang, Yuchen Han, Bei Li, Xiaofeng Zhao, Yuang Li, Min Zhang, Hao Yang, Tong Xiao, Jingbo Zhu

TL;DR

This work proposes a novel generalized feature fusion framework grounded in conditional computation, featuring a gradient-sensitive gating network and a multi-stage dropout strategy that mitigates feature conflicts and bolsters model robustness to multi-view input features.

Abstract

Recent advancements have highlighted the efficacy of self-supervised learning (SSL) features in various speech-related tasks, providing lightweight and versatile multi-view speech representations. However, our study reveals that while SSL features expedite model convergence, they conflict with traditional spectral features like FBanks in terms of update directions. In response, we propose a novel generalized feature fusion framework grounded in conditional computation, featuring a gradient-sensitive gating network and a multi-stage dropout strategy. This framework mitigates feature conflicts and bolsters model robustness to multi-view input features. By integrating SSL and spectral features, our approach accelerates convergence and maintains performance on par with spectral models across multiple speech translation tasks on the MUSTC dataset.

Optimizing Speech Multi-View Feature Fusion through Conditional Computation

TL;DR

Abstract

Paper Structure (15 sections, 13 equations, 5 figures, 4 tables)

This paper contains 15 sections, 13 equations, 5 figures, 4 tables.

Introduction
Method
Architecture
Gradient-sensitive Gating Network for Conflicting Gradient
Multi-stage dropout for Instability Gradient
Experiments
Datasets
Experimental Settings
Results
ANALYSIS
Effectiveness of S-feature
Effectiveness of GSGN
Effectiveness under the pre-trained model
Weight Analysis for GSGN
Conclusion and Future Work

Figures (5)

Figure 1: The BLEU scores of the model using only FBanks or S-feature (a). When we freeze all the model parameters during training and sequentially input the two features, we observe that up to 32% gradients generated by the two features contain conflicting components, and the angle between the two gradients increases as training progresses (b).
Figure 2: ST model architecture.
Figure 3: The $\cos(\theta)$ between two gradients generated by each feature. We find Instability(left): The $\cos(\theta)$ becomes smaller and smaller with training for $\cos(\theta) > 0$, and Conflicting(right): Percent of conflicting gradients($\cos(\theta) < 0$) among all gradients in an input batch.
Figure 4: The mean value of $\mathbf{g}_{\text{fbank}}$ in the model training from scratch (left) and the pre-trained model (right).
Figure 5: The final result of our method.

Optimizing Speech Multi-View Feature Fusion through Conditional Computation

TL;DR

Abstract

Optimizing Speech Multi-View Feature Fusion through Conditional Computation

Authors

TL;DR

Abstract

Table of Contents

Figures (5)