Megrez-Omni Technical Report
Boxun Li, Yadong Li, Zhiyuan Li, Congyi Liu, Weilin Liu, Guowei Niu, Zheyue Tan, Haiyang Xu, Zhuyu Yao, Tao Yuan, Dong Zhou, Yueqing Zhuang, Shengen Yan, Guohao Dai, Yu Wang
TL;DR
Megrez addresses the challenge of deploying capable, on-device multimodal language models for edge environments by introducing Megrez-3B-Instruct and Megrez-3B-Omni, a compact LLM with dedicated visual and audio encoders. The approach combines a standard LLaMA-style language backbone with SigLip-400M visual encoding, Whisper-based audio encoding, and a two-stage training regime (pretrain and SFT) plus vision/audio alignment and omni instruction tuning, enabling strong performance across text, image, and audio tasks at edge-scale. Key contributions include state-of-the-art results for a 3B on-device Omni model on multiple vision benchmarks, competitive language and OCR results, and an open-source release to accelerate edge-side omni-modal AI research. The work demonstrates the practical impact of careful data curation, modality alignment, and efficient training strategies for deploying robust, low-latency multimodal LLMs on resource-constrained devices.
Abstract
In this work, we present the Megrez models, comprising a language model (Megrez-3B-Instruct) and a multimodal model (Megrez-3B-Omni). These models are designed to deliver fast inference, compactness, and robust edge-side intelligence through a software-hardware co-design approach. Megrez-3B-Instruct offers several advantages, including high accuracy, high speed, ease of use, and a wide range of applications. Building on Megrez-3B-Instruct, Megrez-3B-Omni is an on-device multimodal understanding LLM that supports image, text, and audio analysis. It achieves state-of-the-art accuracy across all three modalities and demonstrates strong versatility and robustness, setting a new benchmark for multimodal AI models.
