Table of Contents
Fetching ...

MERIT: Multi-domain Efficient RAW Image Translation

Wenjun Huang, Shenghao Fu, Yian Jin, Yang Ni, Ziteng Cui, Hanning Chen, Yirui He, Yezi Liu, Sanggeon Yun, SungHeon Jeong, Ryozo Masukawa, William Youngwoo Chung, Mohsen Imani

Abstract

RAW images captured by different camera sensors exhibit substantial domain shifts due to varying spectral responses, noise characteristics, and tone behaviors, complicating their direct use in downstream computer vision tasks. Prior methods address this problem by training domain-specific RAW-to-RAW translators for each source-target pair, but such approaches do not scale to real-world scenarios involving multiple types of commercial cameras. In this work, we introduce MERIT, the first unified framework for multi-domain RAW image translation, which leverages a single model to perform translations across arbitrary camera domains. To address domain-specific noise discrepancies, we propose a sensor-aware noise modeling loss that explicitly aligns the signal-dependent noise statistics of the generated images with those of the target domain. We further enhance the generator with a conditional multi-scale large kernel attention module for improved context and sensor-aware feature modeling. To facilitate standardized evaluation, we introduce MDRAW, the first dataset tailored for multi-domain RAW image translation, comprising both paired and unpaired RAW captures from five diverse camera sensors across a wide range of scenes. Extensive experiments demonstrate that MERIT outperforms prior models in both quality (5.56 dB improvement) and scalability (80% reduction in training iterations).

MERIT: Multi-domain Efficient RAW Image Translation

Abstract

RAW images captured by different camera sensors exhibit substantial domain shifts due to varying spectral responses, noise characteristics, and tone behaviors, complicating their direct use in downstream computer vision tasks. Prior methods address this problem by training domain-specific RAW-to-RAW translators for each source-target pair, but such approaches do not scale to real-world scenarios involving multiple types of commercial cameras. In this work, we introduce MERIT, the first unified framework for multi-domain RAW image translation, which leverages a single model to perform translations across arbitrary camera domains. To address domain-specific noise discrepancies, we propose a sensor-aware noise modeling loss that explicitly aligns the signal-dependent noise statistics of the generated images with those of the target domain. We further enhance the generator with a conditional multi-scale large kernel attention module for improved context and sensor-aware feature modeling. To facilitate standardized evaluation, we introduce MDRAW, the first dataset tailored for multi-domain RAW image translation, comprising both paired and unpaired RAW captures from five diverse camera sensors across a wide range of scenes. Extensive experiments demonstrate that MERIT outperforms prior models in both quality (5.56 dB improvement) and scalability (80% reduction in training iterations).
Paper Structure (24 sections, 13 equations, 7 figures, 8 tables)

This paper contains 24 sections, 13 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: Overview of the challenges and paradigms in multi-domain RAW-to-RAW (RAW2RAW) translation. (a). RAW images captured under the same scene vary across sensors due to differing spectral sensitivity and noise characteristics. (b). Without RAW2RAW translation, domain-specific task models must be independently trained for each camera. (c). One-to-one translation enables shared task models, but requires a separate translation model for every source-target pair, leading to scalability issues. (d). Our proposed method, MERIT, introduces a unified multi-domain RAW2RAW translator that supports all domains with a single model. (e). MERIT achieves superior scalability in both parameter count and training iterations as the number of camera domains increases, outperforming prior methods.
  • Figure 2: MERIT Framework Overview. (a). Overall architecture of MERIT. (b). The style encoder $\mathcal{E}$ extracts domain-specific embeddings from target RAW exemplars using a transformer-based architecture. (c). The generator $\mathcal{G}$ synthesizes target-domain RAW images. (d). The discriminator $\mathcal{D}$ operates on image patches and predicts the realism of each patch, using majority voting to make final decisions. (e). The proposed MS-KLA module captures multi-scale features and modulates them via style-aware channel attention.
  • Figure 3: Visualization of the sensor-aware noise modeling process. (a). Input RAW image. (b). Gradient magnitude map computed via the Sobel operator. (c). Red-shaded regions indicate low-texture patches selected for sensor noise profile estimation.
  • Figure 4: Qualitative results on the RAW-to-RAW mapping dataset. Each group of images (from left to right) shows: the input source RAW image, predictions from CUT park2020contrastive, Xie et al. xie2024generalizing, CycleGAN zhu2017unpaired, UVCGAN torbunov2023uvcgan, Rawformer perevozchikov2024rawformer, our method MERIT, and the target ground truth. Odd-numbered rows display the translated RAW outputs, while even-numbered rows visualize the corresponding absolute error maps with respect to the ground truth.
  • Figure 5: Sample images from our MDRAW benchmark. (a). RAW images captured under various lighting and scenes from five different camera sensors. (b). Example of aligned multi-domain RAW captures of the same scene, used for evaluation in cross-domain RAW2RAW translation.
  • ...and 2 more figures