Table of Contents
Fetching ...

Ichigo: Mixed-Modal Early-Fusion Realtime Voice Assistant

Alan Dao, Dinh Bach Vu, Huy Hoang Ha

TL;DR

Ichigo addresses the latency and integration challenges of speech-enabled natural language processing by introducing a tokenized mixed-modal approach that fuses speech and text into a shared token space. Leveraging WhisperVQ for speech tokenization and a uniform decoder-only transformer, Ichigo extends Llama-3 with modality-specific tokens to enable end-to-end reasoning and generation over interleaved speech-text sequences. The work demonstrates state-of-the-art performance on speech-question-answering benchmarks, real-time latency of about 111 ms to first token, and robustness across multi-turn conversations, while releasing a large cross-modal instruction dataset and open-source training/inference code. This approach lowers barriers for smaller research teams to contribute to open-source speech-language modeling and points to practical, low-latency voice assistants that operate effectively on edge hardware.

Abstract

Large Language Models (LLMs) have revolutionized natural language processing, but their application to speech-based tasks remains challenging due to the complexities of integrating audio and text modalities. This paper introduces Ichigo, a mixed-modal model that seamlessly processes interleaved sequences of speech and text. Utilizing a tokenized early-fusion approach, Ichigo quantizes speech into discrete tokens and employs a uniform transformer-based architecture for both speech and text modalities. This method enables joint reasoning and generation across modalities without the need for separate adapters. We present a comprehensive training methodology, including pre-training on multilingual speech recognition datasets and fine-tuning on a curated instruction dataset. Ichigo demonstrates state-of-the-art performance on speech question-answering benchmarks, outperforming existing open-source speech language models and achieving comparable results to cascaded systems. Notably, Ichigo exhibits a latency of just 111 ms to first token generation, significantly lower than current models. Our approach not only advances the field of multimodal AI but also provides a framework for smaller research teams to contribute effectively to open-source speech-language models.

Ichigo: Mixed-Modal Early-Fusion Realtime Voice Assistant

TL;DR

Ichigo addresses the latency and integration challenges of speech-enabled natural language processing by introducing a tokenized mixed-modal approach that fuses speech and text into a shared token space. Leveraging WhisperVQ for speech tokenization and a uniform decoder-only transformer, Ichigo extends Llama-3 with modality-specific tokens to enable end-to-end reasoning and generation over interleaved speech-text sequences. The work demonstrates state-of-the-art performance on speech-question-answering benchmarks, real-time latency of about 111 ms to first token, and robustness across multi-turn conversations, while releasing a large cross-modal instruction dataset and open-source training/inference code. This approach lowers barriers for smaller research teams to contribute to open-source speech-language modeling and points to practical, low-latency voice assistants that operate effectively on edge hardware.

Abstract

Large Language Models (LLMs) have revolutionized natural language processing, but their application to speech-based tasks remains challenging due to the complexities of integrating audio and text modalities. This paper introduces Ichigo, a mixed-modal model that seamlessly processes interleaved sequences of speech and text. Utilizing a tokenized early-fusion approach, Ichigo quantizes speech into discrete tokens and employs a uniform transformer-based architecture for both speech and text modalities. This method enables joint reasoning and generation across modalities without the need for separate adapters. We present a comprehensive training methodology, including pre-training on multilingual speech recognition datasets and fine-tuning on a curated instruction dataset. Ichigo demonstrates state-of-the-art performance on speech question-answering benchmarks, outperforming existing open-source speech language models and achieving comparable results to cascaded systems. Notably, Ichigo exhibits a latency of just 111 ms to first token generation, significantly lower than current models. Our approach not only advances the field of multimodal AI but also provides a framework for smaller research teams to contribute effectively to open-source speech-language models.

Paper Structure

This paper contains 33 sections, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Ichigo represents speech and text modalities as discrete tokens and uses a uniform transformer-based architecture. It uses WhisperVQ to quantize speech into discrete tokens in the same manner with original text modality.
  • Figure 2: Data Processing Pipeline for Speech Instruction Dataset Generation. This chart illustrates the multi-stage filtering and conversion process, starting from 6M samples of multiple open-source instruction text datasets. The data undergoes filtering process results in 2.2M samples. Finally, these samples are converted to speech instruction data using WhisperSpeech (TTS) and WhisperVQ (speech to semantic tokens), creating the 1.3M pairs of Speech instruction and Text answer.
  • Figure 3: a. Distribution of data types in the Instruction Fine-tuning dataset. The goal of this specific distribution was to enhance speech comprehension while maintaining robust general language abilities. b. Distribution of data samples used in the enhancement fine-tuning stage. This specific distribution improves Ichigo robustness in handling multi-turn conversations and inaudible inputs.
  • Figure 4: The system prompt used for Ichigo during inference.
  • Figure 5: The model follows text-based system prompts during speech-based conversations with users.
  • ...and 1 more figures