Table of Contents
Fetching ...

Mobile-O: Unified Multimodal Understanding and Generation on Mobile Device

Abdelrahman Shaker, Ahmed Heakl, Jaseel Muhammad, Ritesh Thawkar, Omkar Thawakar, Senmao Li, Hisham Cholakkal, Ian Reid, Eric P. Xing, Salman Khan, Fahad Shahbaz Khan

TL;DR

Running in only ~3s per 512x512 image on an iPhone, Mobile-O establishes the first practical framework for real-time unified multimodal understanding and generation on edge devices, which it hopes will ease future research in real-time unified multimodal intelligence running entirely on-device with no cloud dependency.

Abstract

Unified multimodal models can both understand and generate visual content within a single architecture. Existing models, however, remain data-hungry and too heavy for deployment on edge devices. We present Mobile-O, a compact vision-language-diffusion model that brings unified multimodal intelligence to a mobile device. Its core module, the Mobile Conditioning Projector (MCP), fuses vision-language features with a diffusion generator using depthwise-separable convolutions and layerwise alignment. This design enables efficient cross-modal conditioning with minimal computational cost. Trained on only a few million samples and post-trained in a novel quadruplet format (generation prompt, image, question, answer), Mobile-O jointly enhances both visual understanding and generation capabilities. Despite its efficiency, Mobile-O attains competitive or superior performance compared to other unified models, achieving 74% on GenEval and outperforming Show-O and JanusFlow by 5% and 11%, while running 6x and 11x faster, respectively. For visual understanding, Mobile-O surpasses them by 15.3% and 5.1% averaged across seven benchmarks. Running in only ~3s per 512x512 image on an iPhone, Mobile-O establishes the first practical framework for real-time unified multimodal understanding and generation on edge devices. We hope Mobile-O will ease future research in real-time unified multimodal intelligence running entirely on-device with no cloud dependency. Our code, models, datasets, and mobile application are publicly available at https://amshaker.github.io/Mobile-O/

Mobile-O: Unified Multimodal Understanding and Generation on Mobile Device

TL;DR

Running in only ~3s per 512x512 image on an iPhone, Mobile-O establishes the first practical framework for real-time unified multimodal understanding and generation on edge devices, which it hopes will ease future research in real-time unified multimodal intelligence running entirely on-device with no cloud dependency.

Abstract

Unified multimodal models can both understand and generate visual content within a single architecture. Existing models, however, remain data-hungry and too heavy for deployment on edge devices. We present Mobile-O, a compact vision-language-diffusion model that brings unified multimodal intelligence to a mobile device. Its core module, the Mobile Conditioning Projector (MCP), fuses vision-language features with a diffusion generator using depthwise-separable convolutions and layerwise alignment. This design enables efficient cross-modal conditioning with minimal computational cost. Trained on only a few million samples and post-trained in a novel quadruplet format (generation prompt, image, question, answer), Mobile-O jointly enhances both visual understanding and generation capabilities. Despite its efficiency, Mobile-O attains competitive or superior performance compared to other unified models, achieving 74% on GenEval and outperforming Show-O and JanusFlow by 5% and 11%, while running 6x and 11x faster, respectively. For visual understanding, Mobile-O surpasses them by 15.3% and 5.1% averaged across seven benchmarks. Running in only ~3s per 512x512 image on an iPhone, Mobile-O establishes the first practical framework for real-time unified multimodal understanding and generation on edge devices. We hope Mobile-O will ease future research in real-time unified multimodal intelligence running entirely on-device with no cloud dependency. Our code, models, datasets, and mobile application are publicly available at https://amshaker.github.io/Mobile-O/
Paper Structure (22 sections, 7 equations, 12 figures, 9 tables)

This paper contains 22 sections, 7 equations, 12 figures, 9 tables.

Figures (12)

  • Figure 1: Comparison of our approach with existing unified models.Left: Qualitative comparison illustrating Mobile-O's capabilities in text-to-image generation, visual understanding, and visual prompt understanding. Right: Quantitative comparison with Show-O, Janus, and JanusFlow, demonstrating that Mobile-O achieves a superior trade-off. Our Mobile-O outperforms Show-O by 5.0% on GenEval and runs significantly faster on iPhone.
  • Figure 2: Comparison of our approach with existing unified models.Left: Qualitative comparison illustrating Mobile-O's capabilities in text-to-image generation, visual understanding, and visual prompt understanding. Right: Quantitative comparison with Show-O, Janus, and JanusFlow, demonstrating that Mobile-O achieves a superior trade-off. Our Mobile-O outperforms Show-O by 5.0% on GenEval and runs significantly faster on iPhone.
  • Figure 3: Overview of Mobile-O. Left: The proposed framework consists of an efficient image encoder with a compact autoregressive language model for visual understanding. For image generation, a lightweight linear diffusion transformer (DiT) is employed alongside a simple yet effective VAE-based encoder–decoder. Right: Our novel Mobile Conditioning Projector (MCP) bridges the understanding and generation tasks by directly conditioning the diffusion model on weighted hidden states from the VLM without the need for intermediate query tokens. The projector leverages layer-wise feature fusion, depthwise separable convolutions, and efficient channel attention to produce high-fidelity conditioning signals with minimal cost, enabling seamless deployment on edge devices.
  • Figure 4: Overview of the proposed unified multimodal post-training pipeline. We jointly optimize multimodal understanding and generation through a multi-task objective using a quadruplet format (generation prompt, image, question, answer). Both I2T and T2I losses are computed simultaneously, enabling aligned cross-modal learning where each training sample supports both multimodal understanding and generation.
  • Figure 5: Qualitative comparison of text-to-image generation (left) and visual understanding (right) across unified multimodal models. Each column shows Janus, JanusFlow, Show-O, and Mobile-O (ours) for the same prompts/questions. Mobile-O yields more consistent, detailed, and semantically faithful images with high fidelity and style diversity for image generation. For visual understanding, it delivers more accurate and contextually coherent responses. Additional results are presented in suppl. material. Best viewed zoomed in.
  • ...and 7 more figures