Table of Contents
Fetching ...

Efficient High-Resolution Image Editing with Hallucination-Aware Loss and Adaptive Tiling

Young D. Kwon, Abhinav Mehrotra, Malcolm Chadwick, Alberto Gil Ramos, Sourav Bhattacharya

TL;DR

This work tackles the challenge of high-resolution on-device image editing with diffusion models by introducing MobilePicasso, a 3-stage hybrid pipeline that edits at standard resolution, projects into latent space, and upscales to 4K. A hallucination-aware loss paired with artefact filtering reduces artefacts, while Adaptive Context-Preserving Tilting (ACPT) and model-system co-design dramatically cut latency and memory usage on mobile hardware. Empirical results show up to 55.8× speedups and 1.15 GB peak memory on a Galaxy S23, with 14–51% reductions in hallucinations and 18–48% improvements in image quality, validated by a 46-participant user study. These advances enable practical, private, on-device high-resolution editing and offer a framework applicable to broader mobile generative AI tasks.

Abstract

High-resolution (4K) image-to-image synthesis has become increasingly important for mobile applications. Existing diffusion models for image editing face significant challenges, in terms of memory and image quality, when deployed on resource-constrained devices. In this paper, we present MobilePicasso, a novel system that enables efficient image editing at high resolutions, while minimising computational cost and memory usage. MobilePicasso comprises three stages: (i) performing image editing at a standard resolution with hallucination-aware loss, (ii) applying latent projection to overcome going to the pixel space, and (iii) upscaling the edited image latent to a higher resolution with adaptive context-preserving tiling. Our user study with 46 participants reveals that MobilePicasso not only improves image quality by 18-48% but reduces hallucinations by 14-51% over existing methods. MobilePicasso demonstrates significantly lower latency, e.g., up to 55.8$\times$ speed-up, yet with a small increase in runtime memory, e.g., a mere 9% increase over prior work. Surprisingly, the on-device runtime of MobilePicasso is observed to be faster than a server-based high-resolution image editing model running on an A100 GPU.

Efficient High-Resolution Image Editing with Hallucination-Aware Loss and Adaptive Tiling

TL;DR

This work tackles the challenge of high-resolution on-device image editing with diffusion models by introducing MobilePicasso, a 3-stage hybrid pipeline that edits at standard resolution, projects into latent space, and upscales to 4K. A hallucination-aware loss paired with artefact filtering reduces artefacts, while Adaptive Context-Preserving Tilting (ACPT) and model-system co-design dramatically cut latency and memory usage on mobile hardware. Empirical results show up to 55.8× speedups and 1.15 GB peak memory on a Galaxy S23, with 14–51% reductions in hallucinations and 18–48% improvements in image quality, validated by a 46-participant user study. These advances enable practical, private, on-device high-resolution editing and offer a framework applicable to broader mobile generative AI tasks.

Abstract

High-resolution (4K) image-to-image synthesis has become increasingly important for mobile applications. Existing diffusion models for image editing face significant challenges, in terms of memory and image quality, when deployed on resource-constrained devices. In this paper, we present MobilePicasso, a novel system that enables efficient image editing at high resolutions, while minimising computational cost and memory usage. MobilePicasso comprises three stages: (i) performing image editing at a standard resolution with hallucination-aware loss, (ii) applying latent projection to overcome going to the pixel space, and (iii) upscaling the edited image latent to a higher resolution with adaptive context-preserving tiling. Our user study with 46 participants reveals that MobilePicasso not only improves image quality by 18-48% but reduces hallucinations by 14-51% over existing methods. MobilePicasso demonstrates significantly lower latency, e.g., up to 55.8 speed-up, yet with a small increase in runtime memory, e.g., a mere 9% increase over prior work. Surprisingly, the on-device runtime of MobilePicasso is observed to be faster than a server-based high-resolution image editing model running on an A100 GPU.

Paper Structure

This paper contains 47 sections, 3 equations, 11 figures, 8 tables, 2 algorithms.

Figures (11)

  • Figure 1: Figure (a) shows the typical examples of the effect of image resolutions on I2I generation. As image resolutions get larger starting from $512\times512$ to $2048\times2048$, I2I image edit models such as IP2P are often unable to produce realistic images that align well with the edit prompt. The measurements of latency (b) and memory (c) to run U-Net on a single tile according to various tile sizes on Snapdragon 8 Gen 2 NPU.
  • Figure 2: The overview of MobilePicasso's 3-stage hybrid pipeline, which partitions the task of high-resolution image editing into three stages: (1) image editing at standard resolution ($512^2$), (2) learnable latent projection in latent space, and (3) upscaling to higher resolutions (4K). This modular approach allows MobilePicasso to solve each stage effectively and efficiently for deployment.
  • Figure 3: The on-device tiling strategy with different overlap ratios and our proposed adjacent padding with 0% overlap.
  • Figure 4: Figures (a,b) show latency results of the tiling-based approach for high-resolution images according to different tile sizes and overlap ratios. All measured on the NPU of the Snapdragon 8 Gen 2 chipset.
  • Figure 5: Qualitative comparison among image editing models and MobilePicasso given images at standard resolutions (e.g., $512\times512$).
  • ...and 6 more figures