MetaISP -- Exploiting Global Scene Structure for Accurate Multi-Device Color Rendition
Matheus Souza, Wolfgang Heidrich
TL;DR
MetaISP addresses the problem of device-specific color rendition by learning a single, lightweight model that translates RAW inputs from one mobile device into RGB outputs that emulate multiple target ISPs. It achieves this through a device-conditional architecture with metadata awareness, a global semantics branch via cross-covariance attention, and a compact device embedding that enables interpolation between styles. The approach is trained with monitor-based pretraining and real-world data, and is shown to outperform prior single-device ISP translation methods in PSNR, SSIM, and perceptual color accuracy while supporting zero-shot and interpolated translations. This work reduces the need for per-device training, improves cross-device color fidelity, and offers practical potential for unified, controllable color rendition across a range of mobile cameras. The techniques—device-conditioned reconstructions, metadata-driven white balance handling, and global semantic integration—have immediate relevance for consumer photography pipelines and cross-device image sharing.
Abstract
Image signal processors (ISPs) are historically grown legacy software systems for reconstructing color images from noisy raw sensor measurements. Each smartphone manufacturer has developed its ISPs with its own characteristic heuristics for improving the color rendition, for example, skin tones and other visually essential colors. The recent interest in replacing the historically grown ISP systems with deep-learned pipelines to match DSLR's image quality improves structural features in the image. However, these works ignore the superior color processing based on semantic scene analysis that distinguishes mobile phone ISPs from DSLRs. Here, we present MetaISP, a single model designed to learn how to translate between the color and local contrast characteristics of different devices. MetaISP takes the RAW image from device A as input and translates it to RGB images that inherit the appearance characteristics of devices A, B, and C. We achieve this result by employing a lightweight deep learning technique that conditions its output appearance based on the device of interest. In this approach, we leverage novel attention mechanisms inspired by cross-covariance to learn global scene semantics. Additionally, we use the metadata that typically accompanies RAW images and estimate scene illuminants when they are unavailable.
