Table of Contents
Fetching ...

Late Fusion and Multi-Level Fission Amplify Cross-Modal Transfer in Text-Speech LMs

Santiago Cuervo, Adel Moumen, Yanis Labrak, Sameer Khurana, Antoine Laurent, Mickael Rouvier, Phil Woodland, Ricard Marxer

TL;DR

This paper tackles the bottleneck in cross-modal transfer for Text-Speech LMs by arguing that early fusion/fission architectures fail to preserve the compositional structure needed to align speech with text. It introduces SmolTolk, a late-fusion model with speech input adapters and multi-level fission guided by a layer selector, enabling speech generation from speech-aware representations while text is produced from the backbone. Empirical results show SmolTolk achieves state-of-the-art-like cross-modal performance with far less compute than prior work, and representation analyses reveal enhanced semantic abstraction and improved alignment between speech and text spaces. The findings suggest that respecting the hierarchical composition of multimodal features is key to efficient cross-modal transfer and may generalize to other modalities.

Abstract

Text-Speech Language Models (TSLMs) -- language models trained to jointly process and generate text and speech -- are commonly trained through an early modality fusion/fission approach, in which both modalities are fed and predicted from a shared backbone via linear layers. We hypothesize that this approach limits cross-modal transfer by neglecting feature compositionality -- specifically, the finer-grained nature of speech representations compared to text -- preventing the emergence of a shared feature hierarchy within model layers. In this paper, we argue that this limitation can be addressed through late fusion and fission, with a fission process that accesses both high- and low-level features for speech generation. Our models implementing these principles, SmolTolk, rival or surpass state-of-the-art TSLMs trained with orders of magnitude more compute, and achieve significantly improved cross-modal performance relative to early fusion/fission baselines. Representation analyses further suggest that our method enhances the model's ability to abstract higher-level, more semantic features from speech, and leads to increasingly shared representation spaces across layers.

Late Fusion and Multi-Level Fission Amplify Cross-Modal Transfer in Text-Speech LMs

TL;DR

This paper tackles the bottleneck in cross-modal transfer for Text-Speech LMs by arguing that early fusion/fission architectures fail to preserve the compositional structure needed to align speech with text. It introduces SmolTolk, a late-fusion model with speech input adapters and multi-level fission guided by a layer selector, enabling speech generation from speech-aware representations while text is produced from the backbone. Empirical results show SmolTolk achieves state-of-the-art-like cross-modal performance with far less compute than prior work, and representation analyses reveal enhanced semantic abstraction and improved alignment between speech and text spaces. The findings suggest that respecting the hierarchical composition of multimodal features is key to efficient cross-modal transfer and may generalize to other modalities.

Abstract

Text-Speech Language Models (TSLMs) -- language models trained to jointly process and generate text and speech -- are commonly trained through an early modality fusion/fission approach, in which both modalities are fed and predicted from a shared backbone via linear layers. We hypothesize that this approach limits cross-modal transfer by neglecting feature compositionality -- specifically, the finer-grained nature of speech representations compared to text -- preventing the emergence of a shared feature hierarchy within model layers. In this paper, we argue that this limitation can be addressed through late fusion and fission, with a fission process that accesses both high- and low-level features for speech generation. Our models implementing these principles, SmolTolk, rival or surpass state-of-the-art TSLMs trained with orders of magnitude more compute, and achieve significantly improved cross-modal performance relative to early fusion/fission baselines. Representation analyses further suggest that our method enhances the model's ability to abstract higher-level, more semantic features from speech, and leads to increasingly shared representation spaces across layers.

Paper Structure

This paper contains 29 sections, 7 equations, 11 figures, 10 tables.

Figures (11)

  • Figure 1: Proposed architecture. The model processes interleaved text-speech sequences. The [swt] token denotes a modality switch. Late fusion: Speech inputs (blue) are processed by speech-specific layers before merging with text embeddings (green) in the text LM backbone. Multi-level fission: an input speech residual and an average across layers' representations with input-dependent weights produce multi-layer representations. Late fission: These are passed through output speech-specific layers to predict speech tokens. Text tokens are predicted from the final backbone layer.
  • Figure 2: Scaling of the LibriSpeech dev set negative log-likelihood (NLL) and tStoryCloze accuracy across modalities with respect to training compute (in FLOPs).
  • Figure 2: Architecture ablation study. “--” denotes removal. "--Dyn. pooling" uses fixed learned weights instead of dynamic ones from the layer selector, while "--Layer pooling" entirely disables multi-layer pooling, relying only on the last text LM layer.
  • Figure 3: Ablations. Top: Intrinsic dimensionality of representations. Bottom: Fraction of variance explained by cross-modal projections.
  • Figure 4: Selector $S$ layer weights across a speech input sequence for SmolTolk-2B. Vertical bars indicate word endings.
  • ...and 6 more figures