Table of Contents
Fetching ...

Megrez-Omni Technical Report

Boxun Li, Yadong Li, Zhiyuan Li, Congyi Liu, Weilin Liu, Guowei Niu, Zheyue Tan, Haiyang Xu, Zhuyu Yao, Tao Yuan, Dong Zhou, Yueqing Zhuang, Shengen Yan, Guohao Dai, Yu Wang

TL;DR

Megrez addresses the challenge of deploying capable, on-device multimodal language models for edge environments by introducing Megrez-3B-Instruct and Megrez-3B-Omni, a compact LLM with dedicated visual and audio encoders. The approach combines a standard LLaMA-style language backbone with SigLip-400M visual encoding, Whisper-based audio encoding, and a two-stage training regime (pretrain and SFT) plus vision/audio alignment and omni instruction tuning, enabling strong performance across text, image, and audio tasks at edge-scale. Key contributions include state-of-the-art results for a 3B on-device Omni model on multiple vision benchmarks, competitive language and OCR results, and an open-source release to accelerate edge-side omni-modal AI research. The work demonstrates the practical impact of careful data curation, modality alignment, and efficient training strategies for deploying robust, low-latency multimodal LLMs on resource-constrained devices.

Abstract

In this work, we present the Megrez models, comprising a language model (Megrez-3B-Instruct) and a multimodal model (Megrez-3B-Omni). These models are designed to deliver fast inference, compactness, and robust edge-side intelligence through a software-hardware co-design approach. Megrez-3B-Instruct offers several advantages, including high accuracy, high speed, ease of use, and a wide range of applications. Building on Megrez-3B-Instruct, Megrez-3B-Omni is an on-device multimodal understanding LLM that supports image, text, and audio analysis. It achieves state-of-the-art accuracy across all three modalities and demonstrates strong versatility and robustness, setting a new benchmark for multimodal AI models.

Megrez-Omni Technical Report

TL;DR

Megrez addresses the challenge of deploying capable, on-device multimodal language models for edge environments by introducing Megrez-3B-Instruct and Megrez-3B-Omni, a compact LLM with dedicated visual and audio encoders. The approach combines a standard LLaMA-style language backbone with SigLip-400M visual encoding, Whisper-based audio encoding, and a two-stage training regime (pretrain and SFT) plus vision/audio alignment and omni instruction tuning, enabling strong performance across text, image, and audio tasks at edge-scale. Key contributions include state-of-the-art results for a 3B on-device Omni model on multiple vision benchmarks, competitive language and OCR results, and an open-source release to accelerate edge-side omni-modal AI research. The work demonstrates the practical impact of careful data curation, modality alignment, and efficient training strategies for deploying robust, low-latency multimodal LLMs on resource-constrained devices.

Abstract

In this work, we present the Megrez models, comprising a language model (Megrez-3B-Instruct) and a multimodal model (Megrez-3B-Omni). These models are designed to deliver fast inference, compactness, and robust edge-side intelligence through a software-hardware co-design approach. Megrez-3B-Instruct offers several advantages, including high accuracy, high speed, ease of use, and a wide range of applications. Building on Megrez-3B-Instruct, Megrez-3B-Omni is an on-device multimodal understanding LLM that supports image, text, and audio analysis. It achieves state-of-the-art accuracy across all three modalities and demonstrates strong versatility and robustness, setting a new benchmark for multimodal AI models.

Paper Structure

This paper contains 27 sections, 5 figures, 7 tables.

Figures (5)

  • Figure 1: (a) Despite having the fewest parameters compared to other models, Megrez-3B-Instruct demonstrates superior accuracy on MMLU benchmark. This further extends the capability boundaries of small-scale models, offering new intelligent solutions for edge devices. (b) Megrez-3B-Omni achieves state-of-the-art performance on a broad range of vision tasks compared with other open source models.
  • Figure 2: Megrez-O architecture. During the training stage, the white module is frozen and the blue module is trained.
  • Figure 3: Qualitative results of Megrez-3B-Omni in captioning, reading chart in images and general chat.
  • Figure 4: Qualitative results of Megrez-3B-Omni in math, OCR, GUI information, conversation and Reasoning.
  • Figure 5: The performance of Megrez-3B-Omni on the OpenCompass test set.