Cold Fusion: Training Seq2Seq Models Together with Language Models
Anuroop Sriram, Heewoo Jun, Sanjeev Satheesh, Adam Coates
TL;DR
The paper addresses domain adaptation in Seq2Seq models by integrating a fixed pre-trained language model during training rather than only at inference. It introduces Cold Fusion, a gating-based fusion mechanism that uses LM logits projected into a common space and fine-grained gating to disentangle language information from task-specific decoding. Empirical results on ASR show faster convergence, improved generalization, and substantially reduced domain transfer gaps with limited labeled data, outperforming Deep Fusion. The approach enables effective use of abundant unlabeled text to bolster Seq2Seq performance in new domains while maintaining efficient decoding and training dynamics.
Abstract
Sequence-to-sequence (Seq2Seq) models with attention have excelled at tasks which involve generating natural language sentences such as machine translation, image captioning and speech recognition. Performance has further been improved by leveraging unlabeled data, often in the form of a language model. In this work, we present the Cold Fusion method, which leverages a pre-trained language model during training, and show its effectiveness on the speech recognition task. We show that Seq2Seq models with Cold Fusion are able to better utilize language information enjoying i) faster convergence and better generalization, and ii) almost complete transfer to a new domain while using less than 10% of the labeled training data.
