Table of Contents
Fetching ...

A Simple Method to Enhance Pre-trained Language Models with Speech Tokens for Classification

Nicolas Calbucura, Valentin Barriere

TL;DR

The paper introduces a simple, low-cost method to augment pre-trained language models with speech information for classification by selecting task-relevant audio tokens via L1-regularized logistic regression on Bag-of-Words features, then pre-training embeddings for these tokens within a frozen LLM and finally fine-tuning with LoRA on downstream tasks. By using a SpeechTokenizer based on RVQ, the approach keeps the audio vocabulary compact (<1% of the original) and demonstrates state-of-the-art performance on argumentative fallacy classification and detection. Key insights include the utility of acoustic-only tokens, the importance of pre-training for audio embeddings, and the robustness of the approach even with random token selections. The work offers a practical, interpretable pathway to fuse speech cues with textual models, with potential extensions to generation tasks and deeper analysis of token-attention dynamics.

Abstract

This paper presents a simple method that allows to easily enhance textual pre-trained large language models with speech information, when fine-tuned for a specific classification task. A classical issue with the fusion of many embeddings from audio with text is the large length of the audio sequence compared to the text one. Our method benefits from an existing speech tokenizer trained for Audio Speech Recognition that output long sequences of tokens from a large vocabulary, making it difficult to integrate it at low cost in a large language model. By applying a simple lasso-based feature selection on multimodal Bag-of-Words representation, we retain only the most important audio tokens for the task, and adapt the language model to them with a self-supervised language modeling objective, before fine-tuning it on the downstream task. We show this helps to improve the performances compared to an unimodal model, to a bigger SpeechLM or to integrating audio via a learned representation. We show the effectiveness of our method on two recent Argumentative Fallacy Detection and Classification tasks where the use of audio was believed counterproductive, reaching state-of-the-art results. We also provide an in-depth analysis of the method, showing that even a random audio token selection helps enhancing the unimodal model. Our code is available [online](https://github.com/salocinc/EACL26SpeechTokFallacy/).

A Simple Method to Enhance Pre-trained Language Models with Speech Tokens for Classification

TL;DR

The paper introduces a simple, low-cost method to augment pre-trained language models with speech information for classification by selecting task-relevant audio tokens via L1-regularized logistic regression on Bag-of-Words features, then pre-training embeddings for these tokens within a frozen LLM and finally fine-tuning with LoRA on downstream tasks. By using a SpeechTokenizer based on RVQ, the approach keeps the audio vocabulary compact (<1% of the original) and demonstrates state-of-the-art performance on argumentative fallacy classification and detection. Key insights include the utility of acoustic-only tokens, the importance of pre-training for audio embeddings, and the robustness of the approach even with random token selections. The work offers a practical, interpretable pathway to fuse speech cues with textual models, with potential extensions to generation tasks and deeper analysis of token-attention dynamics.

Abstract

This paper presents a simple method that allows to easily enhance textual pre-trained large language models with speech information, when fine-tuned for a specific classification task. A classical issue with the fusion of many embeddings from audio with text is the large length of the audio sequence compared to the text one. Our method benefits from an existing speech tokenizer trained for Audio Speech Recognition that output long sequences of tokens from a large vocabulary, making it difficult to integrate it at low cost in a large language model. By applying a simple lasso-based feature selection on multimodal Bag-of-Words representation, we retain only the most important audio tokens for the task, and adapt the language model to them with a self-supervised language modeling objective, before fine-tuning it on the downstream task. We show this helps to improve the performances compared to an unimodal model, to a bigger SpeechLM or to integrating audio via a learned representation. We show the effectiveness of our method on two recent Argumentative Fallacy Detection and Classification tasks where the use of audio was believed counterproductive, reaching state-of-the-art results. We also provide an in-depth analysis of the method, showing that even a random audio token selection helps enhancing the unimodal model. Our code is available [online](https://github.com/salocinc/EACL26SpeechTokFallacy/).

Paper Structure

This paper contains 20 sections, 1 equation, 3 figures, 2 tables.

Figures (3)

  • Figure 1: The Step 1 of our method consists in audio token selection pipeline based on an $\ell_1$ logistic regression using Bag-of-Word representation. This results on fewer Audio Tokens selected for a specific task.
  • Figure 2: The Step 2 and Step 3 consist in the pretraining the audio tokens embeddings followed by the fine-tuning of the multimodal LLM on the downstream task.
  • Figure 3: Task prompt for the Qwen2-Audio model for In-Context-Learning