Table of Contents
Fetching ...

MiniMind-O Technical Report: An Open Small-Scale Speech-Native Omni Model

Jingyao Gong

Abstract

MiniMind-O is an open 0.1B-scale omni model built on the MiniMind language model. It accepts text, speech, and image inputs, and returns both text and streaming speech. The release includes model code, checkpoints, and the main Parquet training datasets for text-to-audio, image-to-text, and audio-to-audio training, making the complete interaction loop directly inspectable. The model uses a full MiniMind backbone as the Thinker and an independent four-layer Talker made from MiniMind blocks. Frozen SenseVoice-Small and SigLIP2 encoders provide speech and image features, which are mapped by lightweight MLP projectors and injected at modality-placeholder positions. The Talker reads a middle-layer Thinker state together with an autoregressive eight-layer Mimi-code buffer. Speaker control is handled by a dedicated speaker token, right-aligned reference codec prompts, and precomputed CAM++ speaker embeddings, so voice conditioning remains part of the audio-code context rather than a separate TTS module. With a 768-dimensional Talker, the dense and MoE variants reach average CERs of 0.0897 and 0.0900 in Thinker--Talker consistency evaluation, with overall voice-cloning similarities of 0.5995 and 0.5937. Beyond reporting a working system, the paper identifies three scale-critical design choices for small omni models: middle-layer semantic bridging, a released multimodal sequence format, and a parameter-efficient eight-codebook interface.

MiniMind-O Technical Report: An Open Small-Scale Speech-Native Omni Model

Abstract

MiniMind-O is an open 0.1B-scale omni model built on the MiniMind language model. It accepts text, speech, and image inputs, and returns both text and streaming speech. The release includes model code, checkpoints, and the main Parquet training datasets for text-to-audio, image-to-text, and audio-to-audio training, making the complete interaction loop directly inspectable. The model uses a full MiniMind backbone as the Thinker and an independent four-layer Talker made from MiniMind blocks. Frozen SenseVoice-Small and SigLIP2 encoders provide speech and image features, which are mapped by lightweight MLP projectors and injected at modality-placeholder positions. The Talker reads a middle-layer Thinker state together with an autoregressive eight-layer Mimi-code buffer. Speaker control is handled by a dedicated speaker token, right-aligned reference codec prompts, and precomputed CAM++ speaker embeddings, so voice conditioning remains part of the audio-code context rather than a separate TTS module. With a 768-dimensional Talker, the dense and MoE variants reach average CERs of 0.0897 and 0.0900 in Thinker--Talker consistency evaluation, with overall voice-cloning similarities of 0.5995 and 0.5937. Beyond reporting a working system, the paper identifies three scale-critical design choices for small omni models: middle-layer semantic bridging, a released multimodal sequence format, and a parameter-efficient eight-codebook interface.

Paper Structure

This paper contains 14 sections, 1 equation, 11 figures, 10 tables.

Figures (11)

  • Figure 1: Architecture of MiniMind-O. Audio and image inputs are encoded by frozen SenseVoice and SigLIP2 encoders, mapped into the MiniMind hidden space by MLP projectors, and injected at modality-placeholder positions. A middle-layer Thinker state is fused with the Mimi-code history by an independent Talker, which predicts eight codec layers for streaming speech generation.
  • Figure 2: Talker-side speech generation design. The Talker consumes the Thinker bridge state, audio-code embeddings, optional speaker information, and reference codec prompts, then emits eight-layer Mimi codebook logits for waveform decoding.
  • Figure 3: Training sequence format for Thinker and Talker. Text supervision is applied to the Thinker response tokens, while audio supervision is applied to target Mimi code positions. Reference-code regions are used as conditioning context rather than loss targets.
  • Figure 4: Training pipeline used by the current implementation. The active training script runs train_sft_omni.py on T2A, I2T, and A2A data, with all mode for full-model updates and a vision_proj pass for projector-only visual alignment. SenseVoice and SigLIP2 remain frozen during training.
  • Figure 5: Input token layout in MiniMind-O. Text tokens, audio placeholders, image placeholders, speaker tokens, reference codes, and target audio codes occupy aligned positions so that the Thinker and Talker can be trained under a single autoregressive schedule.
  • ...and 6 more figures