Table of Contents
Fetching ...

Vision-Speech Models: Teaching Speech Models to Converse about Images

Amélie Royer, Moritz Böhle, Gabriel de Marmiesse, Laurent Mazaré, Neil Zeghidour, Alexandre Défossez, Patrick Pérez

TL;DR

This work tackles building a practical Vision-Speech Model by turning a real-time conversational speech LLM (Moshi) into MoshiVis through lightweight image-adaptation modules and a gated cross-attention mechanism. It uses a simple, one-stage fine-tuning pipeline that mixes speechful and speechless data to leverage abundant Vision-Language resources while keeping audio supervision minimal, complemented by a synthetic dialogue generation pipeline for training conversational behavior. The model demonstrates competitive vision-language understanding in both text and audio prompts, exhibits improved context-switch robustness, and maintains real-time latency on consumer-grade hardware. By releasing inference code and audio benchmarks, the work provides a reproducible, scalable path toward multimodal dialogue systems that can discuss images and general topics in natural, real-time conversations.

Abstract

The recent successes of Vision-Language models raise the question of how to equivalently imbue a pretrained speech model with vision understanding, an important milestone towards building a multimodal speech model able to freely converse about images. Building such a conversational Vision-Speech model brings its unique challenges: (i) paired image-speech datasets are much scarcer than their image-text counterparts, (ii) ensuring real-time latency at inference is crucial thus bringing compute and memory constraints, and (iii) the model should preserve prosodic features (e.g., speaker tone) which cannot be inferred from text alone. In this work, we introduce MoshiVis, augmenting a recent dialogue speech LLM, Moshi, with visual inputs through lightweight adaptation modules. An additional dynamic gating mechanism enables the model to more easily switch between the visual inputs and unrelated conversation topics. To reduce training costs, we design a simple one-stage, parameter-efficient fine-tuning pipeline in which we leverage a mixture of image-text (i.e., "speechless") and image-speech samples. We evaluate the model on downstream visual understanding tasks with both audio and text prompts, and report qualitative samples of interactions with MoshiVis. Our inference code will be made available, as well as the image-speech data used for audio evaluation.

Vision-Speech Models: Teaching Speech Models to Converse about Images

TL;DR

This work tackles building a practical Vision-Speech Model by turning a real-time conversational speech LLM (Moshi) into MoshiVis through lightweight image-adaptation modules and a gated cross-attention mechanism. It uses a simple, one-stage fine-tuning pipeline that mixes speechful and speechless data to leverage abundant Vision-Language resources while keeping audio supervision minimal, complemented by a synthetic dialogue generation pipeline for training conversational behavior. The model demonstrates competitive vision-language understanding in both text and audio prompts, exhibits improved context-switch robustness, and maintains real-time latency on consumer-grade hardware. By releasing inference code and audio benchmarks, the work provides a reproducible, scalable path toward multimodal dialogue systems that can discuss images and general topics in natural, real-time conversations.

Abstract

The recent successes of Vision-Language models raise the question of how to equivalently imbue a pretrained speech model with vision understanding, an important milestone towards building a multimodal speech model able to freely converse about images. Building such a conversational Vision-Speech model brings its unique challenges: (i) paired image-speech datasets are much scarcer than their image-text counterparts, (ii) ensuring real-time latency at inference is crucial thus bringing compute and memory constraints, and (iii) the model should preserve prosodic features (e.g., speaker tone) which cannot be inferred from text alone. In this work, we introduce MoshiVis, augmenting a recent dialogue speech LLM, Moshi, with visual inputs through lightweight adaptation modules. An additional dynamic gating mechanism enables the model to more easily switch between the visual inputs and unrelated conversation topics. To reduce training costs, we design a simple one-stage, parameter-efficient fine-tuning pipeline in which we leverage a mixture of image-text (i.e., "speechless") and image-speech samples. We evaluate the model on downstream visual understanding tasks with both audio and text prompts, and report qualitative samples of interactions with MoshiVis. Our inference code will be made available, as well as the image-speech data used for audio evaluation.

Paper Structure

This paper contains 24 sections, 7 figures, 7 tables.

Figures (7)

  • Figure 1: MoshiVis is a Vision-Speech model (VSM) able to hold full-duplex real-time conversations about an image, and trained with a light data- and compute- budget. For image representations, we use off-the-shelf transformer-based image encoders from the PaliGemma family beyer2024paligemma. For the speech modelling part, we rely on Moshi kyutai2024moshi, a recent speech LLM which jointly outputs text and audio tokens in real-time, allowing for full-duplex conversations. At its core, Moshi consists of a standard 7B decoder-only transformer taking as inputs speech tokens (which are the sums of temporally aligned text tokens and audio tokens extracted from the assistant's and user's streams), rather than only text like a standard LLM. The output of the transformer is then separately decoded in a text token, as well as passed through a small depth transformer which auto-regressively produces a hierarchy of audio codebooks, then decoded into audio frames. First, (\ref{['sec:align']}), we detail how we augment the speech LLM's transformer with lightweight visual adaptation modules through cross-attention (CA). We then describe our one-stage finetuning pipeline for these modules: We use a mixture of (i) (\ref{['sec:train']}) image+text only data ("speechless" data), which, despite incurring a distribution shift due to the lack of audio supervision, allows us to leverage the large body of existing Vision-Language datasets, and (ii) (\ref{['sec:data']}) synthetic spoken visual dialogues which we design to mimic realistic discussions about images.
  • Figure 2: Adaptation modules. The image tokens are injected into the current speech token via residual cross-attention (CA) layers, placed between the multi-head self attention (MHSA) and the feedforward network (FFN) in every transformer block. As the cross-attention's QKV projections are shared across layers $\left(\text{\faChain{}}\right)$, at inference, we only need to compute the keys and values once per image, thus reducing the memory cost needed to store the image embeddings. To enable more context switch, we modulate the output of the cross-attention with a binary gate. The resulting output is fed back into the speech token stream as a residual.
  • Figure 3: MoshiVis forward pass during mixed data training. Speech samples are composed of the user's and assistant's audio streams () and a text stream () (only for the assistant) containing extra padding tokens (_) to maintain the temporal alignment with speech. The input streams are summed and passed to the transformer. The output audio streams are auto-regressively decoded by a small transformer (Audio Depth Transformer). In practice, we only train the first two audio streams for speech samples. This allows for faster training as we need fewer parallel calls to the depth transformer. In contrast, speechless samples only contain standard text; in this case, MoshiVis acts as a standard transformer augmented with additional adaptation modules ().
  • Figure 4: Training MoshiVis with different amounts of audio data on a) OCR-VQA, b) VQAv2, and c) COCO. In particular, we show the scores obtained by the model when prompting it either with text () or audio ( ) and using greedy decoding. Note that even when training with no audio data at all, the cross-attention mechanism enables the speech model to obtain results substantially above chance on all benchmarks. While this can come at the cost of audio quality, we find that adding as little as 1% of audio data effectively recovers the base model's audio quality (\ref{['tab:mosnet_results']}). For reference, we also report the results of the fine-tuned PaliGemma (stage 3 of beyer2024paligemma), as we use the same image encoder as a starting point for fine-tuning; note that in contrast to PaliGemma, we keep the image encoder and LLM frozen.
  • Figure 5: Context Switch Ablation. To assess the impact of data augmentation (left vs. right) as well as the gating configuration (different line styles), we prefix every MMLU question with a randomly sampled conversation about an image (V$\rightarrow$NV), and every COCO captioning question with a randomly sampled general discussion (NV$\rightarrow$V). We report the model's relative performance as a function of the random prefix length's (expressed in number of question-answer turns). We find that both data augmentation and gating improve the model's robustness to context switching.
  • ...and 2 more figures