Late Fusion and Multi-Level Fission Amplify Cross-Modal Transfer in Text-Speech LMs
Santiago Cuervo, Adel Moumen, Yanis Labrak, Sameer Khurana, Antoine Laurent, Mickael Rouvier, Phil Woodland, Ricard Marxer
TL;DR
This paper tackles the bottleneck in cross-modal transfer for Text-Speech LMs by arguing that early fusion/fission architectures fail to preserve the compositional structure needed to align speech with text. It introduces SmolTolk, a late-fusion model with speech input adapters and multi-level fission guided by a layer selector, enabling speech generation from speech-aware representations while text is produced from the backbone. Empirical results show SmolTolk achieves state-of-the-art-like cross-modal performance with far less compute than prior work, and representation analyses reveal enhanced semantic abstraction and improved alignment between speech and text spaces. The findings suggest that respecting the hierarchical composition of multimodal features is key to efficient cross-modal transfer and may generalize to other modalities.
Abstract
Text-Speech Language Models (TSLMs) -- language models trained to jointly process and generate text and speech -- are commonly trained through an early modality fusion/fission approach, in which both modalities are fed and predicted from a shared backbone via linear layers. We hypothesize that this approach limits cross-modal transfer by neglecting feature compositionality -- specifically, the finer-grained nature of speech representations compared to text -- preventing the emergence of a shared feature hierarchy within model layers. In this paper, we argue that this limitation can be addressed through late fusion and fission, with a fission process that accesses both high- and low-level features for speech generation. Our models implementing these principles, SmolTolk, rival or surpass state-of-the-art TSLMs trained with orders of magnitude more compute, and achieve significantly improved cross-modal performance relative to early fusion/fission baselines. Representation analyses further suggest that our method enhances the model's ability to abstract higher-level, more semantic features from speech, and leads to increasingly shared representation spaces across layers.
