Table of Contents
Fetching ...

MetaISP -- Exploiting Global Scene Structure for Accurate Multi-Device Color Rendition

Matheus Souza, Wolfgang Heidrich

TL;DR

MetaISP addresses the problem of device-specific color rendition by learning a single, lightweight model that translates RAW inputs from one mobile device into RGB outputs that emulate multiple target ISPs. It achieves this through a device-conditional architecture with metadata awareness, a global semantics branch via cross-covariance attention, and a compact device embedding that enables interpolation between styles. The approach is trained with monitor-based pretraining and real-world data, and is shown to outperform prior single-device ISP translation methods in PSNR, SSIM, and perceptual color accuracy while supporting zero-shot and interpolated translations. This work reduces the need for per-device training, improves cross-device color fidelity, and offers practical potential for unified, controllable color rendition across a range of mobile cameras. The techniques—device-conditioned reconstructions, metadata-driven white balance handling, and global semantic integration—have immediate relevance for consumer photography pipelines and cross-device image sharing.

Abstract

Image signal processors (ISPs) are historically grown legacy software systems for reconstructing color images from noisy raw sensor measurements. Each smartphone manufacturer has developed its ISPs with its own characteristic heuristics for improving the color rendition, for example, skin tones and other visually essential colors. The recent interest in replacing the historically grown ISP systems with deep-learned pipelines to match DSLR's image quality improves structural features in the image. However, these works ignore the superior color processing based on semantic scene analysis that distinguishes mobile phone ISPs from DSLRs. Here, we present MetaISP, a single model designed to learn how to translate between the color and local contrast characteristics of different devices. MetaISP takes the RAW image from device A as input and translates it to RGB images that inherit the appearance characteristics of devices A, B, and C. We achieve this result by employing a lightweight deep learning technique that conditions its output appearance based on the device of interest. In this approach, we leverage novel attention mechanisms inspired by cross-covariance to learn global scene semantics. Additionally, we use the metadata that typically accompanies RAW images and estimate scene illuminants when they are unavailable.

MetaISP -- Exploiting Global Scene Structure for Accurate Multi-Device Color Rendition

TL;DR

MetaISP addresses the problem of device-specific color rendition by learning a single, lightweight model that translates RAW inputs from one mobile device into RGB outputs that emulate multiple target ISPs. It achieves this through a device-conditional architecture with metadata awareness, a global semantics branch via cross-covariance attention, and a compact device embedding that enables interpolation between styles. The approach is trained with monitor-based pretraining and real-world data, and is shown to outperform prior single-device ISP translation methods in PSNR, SSIM, and perceptual color accuracy while supporting zero-shot and interpolated translations. This work reduces the need for per-device training, improves cross-device color fidelity, and offers practical potential for unified, controllable color rendition across a range of mobile cameras. The techniques—device-conditioned reconstructions, metadata-driven white balance handling, and global semantic integration—have immediate relevance for consumer photography pipelines and cross-device image sharing.

Abstract

Image signal processors (ISPs) are historically grown legacy software systems for reconstructing color images from noisy raw sensor measurements. Each smartphone manufacturer has developed its ISPs with its own characteristic heuristics for improving the color rendition, for example, skin tones and other visually essential colors. The recent interest in replacing the historically grown ISP systems with deep-learned pipelines to match DSLR's image quality improves structural features in the image. However, these works ignore the superior color processing based on semantic scene analysis that distinguishes mobile phone ISPs from DSLRs. Here, we present MetaISP, a single model designed to learn how to translate between the color and local contrast characteristics of different devices. MetaISP takes the RAW image from device A as input and translates it to RGB images that inherit the appearance characteristics of devices A, B, and C. We achieve this result by employing a lightweight deep learning technique that conditions its output appearance based on the device of interest. In this approach, we leverage novel attention mechanisms inspired by cross-covariance to learn global scene semantics. Additionally, we use the metadata that typically accompanies RAW images and estimate scene illuminants when they are unavailable.
Paper Structure (22 sections, 4 equations, 17 figures, 6 tables)

This paper contains 22 sections, 4 equations, 17 figures, 6 tables.

Figures (17)

  • Figure 1: Two scenes were captured using three different smartphone models and processed with different ISPs. When all RAW images were processed using the same OpenISP, the resulting appearance from all models was very similar. However, when the native ISPs for each device were used, strong appearance differences became visible.
  • Figure 1: The illuminants analysis is depicted in the images on the left side, which illustrate examples where the estimation of illuminants aids in achieving more accurate colors. On the right side, we present some failure cases where the estimation layer is unable to provide accurate white balance information, resulting in color deviations.
  • Figure 2: a) is an overview of the MetaISP pipeline, b) exemplifies the residual attention block, and c) describes the attention mechanism inside the residual blocks. In this architecture, the RAW patch image $x$ and $e_{d}$ go through the illumination branch, outputting the scene white balance. Next, the metadata awareness block projects $WB_{d}$ to match the backbone network's first two levels and aggregate the ISO and exposure time. Later, a downsampled full-resolution version of the image goes through the transformer-based global semantics block to extract global features, matching them inside the bottleneck. Finally, the $e_{d}$ conditions the decoding part with different device embeddings, serving as a scaling factor in the bottleneck and as the query vector inside the decoder attention mechanism. The number of convolutional layers, residual attention blocks, and residual blocks are the same as the Figure depicts. The residual block follows the b) diagram without the attention and the innermost skip connection.
  • Figure 2: Full-resolution inference visualization. Here, we highlight the details of the horse that are more apparent in the Pixel images. MetaISP can accurately reproduce the color perception using the raw input from the iPhone, showcasing its ability to generate visually pleasing and accurate results, even in cases where the color rendition of different devices varies. This demonstrates the effectiveness of our approach in preserving and reproducing device-specific color characteristics, as evident in the high-quality output visualizations.
  • Figure 3: Patch-wise comparison between the different methods and the ground truth. The images mainly represent challenging scenarios, such as indoor and night scenes. Our model can accurately reproduce the color appearance of each target device. The alignment strategy described in the supplementary effectively avoids blurry effects that may be observed in SwinIR, resulting in sharper and more visually pleasing results. On the other hand, LiteISP does not suffer from blurry effects but struggles reproducing the color perception.
  • ...and 12 more figures