Rawformer: Unpaired Raw-to-Raw Translation for Learnable Camera ISPs

Georgy Perevozchikov; Nancy Mehta; Mahmoud Afifi; Radu Timofte

Rawformer: Unpaired Raw-to-Raw Translation for Learnable Camera ISPs

Georgy Perevozchikov, Nancy Mehta, Mahmoud Afifi, Radu Timofte

TL;DR

This work tackles the challenge of generalizing neural-based camera ISPs to unseen devices by proposing Rawformer, a fully unsupervised Transformer-based raw-to-raw translator that operates within a CycleGAN framework. By introducing a contextual-scale aware encoder/decoder with a condensed query attention mechanism, multi-scale SPFN, and a cross-domain discriminator with training-time caching, Rawformer translates raw sensor data across cameras without paired raw-sRGB data. The approach achieves state-of-the-art performance on real camera datasets, enabling reuse of pre-trained neural ISPs and reducing data collection costs, with demonstrated robustness on low-light conditions and across DSLR-mobile mappings. Practical significance lies in enabling scalable deployment of learnable ISPs across diverse devices, potentially accelerating real-world adoption while highlighting future work toward real-time CPU performance and lighter models.

Abstract

Modern smartphone camera quality heavily relies on the image signal processor (ISP) to enhance captured raw images, utilizing carefully designed modules to produce final output images encoded in a standard color space (e.g., sRGB). Neural-based end-to-end learnable ISPs offer promising advancements, potentially replacing traditional ISPs with their ability to adapt without requiring extensive tuning for each new camera model, as is often the case for nearly every module in traditional ISPs. However, the key challenge with the recent learning-based ISPs is the urge to collect large paired datasets for each distinct camera model due to the influence of intrinsic camera characteristics on the formation of input raw images. This paper tackles this challenge by introducing a novel method for unpaired learning of raw-to-raw translation across diverse cameras. Specifically, we propose Rawformer, an unsupervised Transformer-based encoder-decoder method for raw-to-raw translation. It accurately maps raw images captured by a certain camera to the target camera, facilitating the generalization of learnable ISPs to new unseen cameras. Our method demonstrates superior performance on real camera datasets, achieving higher accuracy compared to previous state-of-the-art techniques, and preserving a more robust correlation between the original and translated raw images. The codes and the pretrained models are available at https://github.com/gosha20777/rawformer.

Rawformer: Unpaired Raw-to-Raw Translation for Learnable Camera ISPs

TL;DR

Abstract

Paper Structure (31 sections, 8 equations, 11 figures, 15 tables)

This paper contains 31 sections, 8 equations, 11 figures, 15 tables.

Introduction
Related Work
Neural-Based ISP
Image-to-Image Domain Adaptation
Proposed Method
Generator's Encoder Network
Condensed Query Attention Block:
Scale Perceptive Feed-forward Network:
Composite Downsampler:
Style Modulator:
Generator's Decoder Network
Cross-Domain Attention-Guided Discriminator
Self-Supervised Pre-Training
Experiments
Datasets
...and 16 more sections

Figures (11)

Figure 1: We introduce Rawformer, an unsupervised method for raw-to-raw translation that allows the utilization of pre-trained neural-based ISPs to process raw images captured by previously unseen cameras. Shown are raw and sRGB images processed by a neural-based ISP wirzberger2022lan. a) Raw image from an iPhone X rendered by the neural-based ISP trained on iPhone X's raw images. b) Raw image from an iPhone X rendered by the neural-based ISP trained on Samsung S9's raw images. c) iPhone X raw image translated to Samsung S9's raw space using our method, then processed by the Samsung S9 neural-based ISP. d) Raw image captured by Samsung S9's camera rendered by its native camera ISP, provided as a reference for visual comparison.
Figure 1: Overview of the proposed architecture and training flow. $A^t$ and $B^t$ refer to translated images used by the discriminator loss, while $A^c$ and $B^c$ refer to the produced images used by the cycle consistency loss.
Figure 2: a) Overview of the generator architecture (a) of Rawformer. The primary components of the generator: b) contextual-scale aware downsampler block (CSAD), c) condensed query attention block (CQA), d) scale perceptive feed-forward network (SPFN), e) composite downsampling block (CDown), f) composite upsampler (CUp) block, and g) contextual-scale aware upsampler block (CSAU).
Figure 2: Details of the proposed style modulation process.
Figure 3: a) The proposed cross-domain attention-guided discriminator. b) Feature map visualizations without (w/o) and with (w) the discriminator head. The inclusion of the discriminator head aids in refining the overall results, and the discriminator training (as shown in c) benefits from the inclusion of the discriminator head, particularly with the cache ($M$) component.
...and 6 more figures

Rawformer: Unpaired Raw-to-Raw Translation for Learnable Camera ISPs

TL;DR

Abstract

Rawformer: Unpaired Raw-to-Raw Translation for Learnable Camera ISPs

Authors

TL;DR

Abstract

Table of Contents

Figures (11)