Ichigo: Mixed-Modal Early-Fusion Realtime Voice Assistant
Alan Dao, Dinh Bach Vu, Huy Hoang Ha
TL;DR
Ichigo addresses the latency and integration challenges of speech-enabled natural language processing by introducing a tokenized mixed-modal approach that fuses speech and text into a shared token space. Leveraging WhisperVQ for speech tokenization and a uniform decoder-only transformer, Ichigo extends Llama-3 with modality-specific tokens to enable end-to-end reasoning and generation over interleaved speech-text sequences. The work demonstrates state-of-the-art performance on speech-question-answering benchmarks, real-time latency of about 111 ms to first token, and robustness across multi-turn conversations, while releasing a large cross-modal instruction dataset and open-source training/inference code. This approach lowers barriers for smaller research teams to contribute to open-source speech-language modeling and points to practical, low-latency voice assistants that operate effectively on edge hardware.
Abstract
Large Language Models (LLMs) have revolutionized natural language processing, but their application to speech-based tasks remains challenging due to the complexities of integrating audio and text modalities. This paper introduces Ichigo, a mixed-modal model that seamlessly processes interleaved sequences of speech and text. Utilizing a tokenized early-fusion approach, Ichigo quantizes speech into discrete tokens and employs a uniform transformer-based architecture for both speech and text modalities. This method enables joint reasoning and generation across modalities without the need for separate adapters. We present a comprehensive training methodology, including pre-training on multilingual speech recognition datasets and fine-tuning on a curated instruction dataset. Ichigo demonstrates state-of-the-art performance on speech question-answering benchmarks, outperforming existing open-source speech language models and achieving comparable results to cascaded systems. Notably, Ichigo exhibits a latency of just 111 ms to first token generation, significantly lower than current models. Our approach not only advances the field of multimodal AI but also provides a framework for smaller research teams to contribute effectively to open-source speech-language models.
