Table of Contents
Fetching ...

RawGen: Learning Camera Raw Image Generation

Dongyoung Kim, Junyong Lee, Abhijith Punnappurath, Mahmoud Afifi, Sangmin Han, Alex Levinshtein, Michael S. Brown

Abstract

Cameras capture scene-referred linear raw images, which are processed by onboard image signal processors (ISPs) into display-referred 8-bit sRGB outputs. Although raw data is more faithful for low-level vision tasks, collecting large-scale raw datasets remains a major bottleneck, as existing datasets are limited and tied to specific camera hardware. Generative models offer a promising way to address this scarcity -- however, existing diffusion frameworks are designed to synthesize photo-finished sRGB images rather than physically consistent linear representations. This paper presents RawGen, to our knowledge the first diffusion-based framework enabling text-to-raw generation for arbitrary target cameras, alongside sRGB-to-raw inversion. RawGen leverages the generative priors of large-scale sRGB diffusion models to synthesize physically meaningful linear outputs, such as CIE XYZ or camera-specific raw representations, via specialized processing in latent and pixel spaces. To handle unknown and diverse ISP pipelines and photo-finishing effects in diffusion-model training data, we build a many-to-one inverse-ISP dataset where multiple sRGB renditions of the same scene generated using diverse ISP parameters are anchored to a common scene-referred target. Fine-tuning a conditional denoiser and specialized decoder on this dataset allows RawGen to obtain camera-centric linear reconstructions that effectively invert the rendering pipeline. We demonstrate RawGen's superior performance over traditional inverse-ISP methods that assume a fixed ISP. Furthermore, we show that augmenting training pipelines with RawGen's scalable, text-driven synthetic data can benefit downstream low-level vision tasks.

RawGen: Learning Camera Raw Image Generation

Abstract

Cameras capture scene-referred linear raw images, which are processed by onboard image signal processors (ISPs) into display-referred 8-bit sRGB outputs. Although raw data is more faithful for low-level vision tasks, collecting large-scale raw datasets remains a major bottleneck, as existing datasets are limited and tied to specific camera hardware. Generative models offer a promising way to address this scarcity -- however, existing diffusion frameworks are designed to synthesize photo-finished sRGB images rather than physically consistent linear representations. This paper presents RawGen, to our knowledge the first diffusion-based framework enabling text-to-raw generation for arbitrary target cameras, alongside sRGB-to-raw inversion. RawGen leverages the generative priors of large-scale sRGB diffusion models to synthesize physically meaningful linear outputs, such as CIE XYZ or camera-specific raw representations, via specialized processing in latent and pixel spaces. To handle unknown and diverse ISP pipelines and photo-finishing effects in diffusion-model training data, we build a many-to-one inverse-ISP dataset where multiple sRGB renditions of the same scene generated using diverse ISP parameters are anchored to a common scene-referred target. Fine-tuning a conditional denoiser and specialized decoder on this dataset allows RawGen to obtain camera-centric linear reconstructions that effectively invert the rendering pipeline. We demonstrate RawGen's superior performance over traditional inverse-ISP methods that assume a fixed ISP. Furthermore, we show that augmenting training pipelines with RawGen's scalable, text-driven synthetic data can benefit downstream low-level vision tasks.

Paper Structure

This paper contains 46 sections, 12 equations, 20 figures, 4 tables.

Figures (20)

  • Figure 1: We present RawGen, a diffusion-based method for generating realistic camera raw images. RawGen produces a latent representation of linear CIE XYZ images, conditioned on an sRGB image or a text prompt. The latent is decoded to CIE XYZ and mapped to arbitrary target camera raw spaces.
  • Figure 2: Overview of the RawGen framework. During training, raw images are converted to CIE XYZ and sRGB representations to (A) fine-tune the DiT to denoise CIE XYZ latents conditioned on an sRGB image, and (B) fine-tune the VAE decoder to reconstruct CIE XYZ images. During inference (C), either an sRGB image (image-to-raw, I2R) or a text prompt (text-to-raw, T2R) is used to condition the DiT to generate a CIE XYZ latent, which is decoded to obtain a CIE XYZ image and subsequently mapped to the target camera's raw space using its calibration metadata, which can be easily acquired from a single DNG file of target camera.
  • Figure 2: Latent space compactness of sRGB-to-XYZ conversion methods. 100 color-graded sRGB variants per prompt are converted to XYZ and encoded into the VAE latent space; lower mean L2 distance to the centroid indicates more compact clusters.
  • Figure 3: CIE XYZ reconstruction results. Shown is an sRGB input image rendered with different rendering preferences (Expert A--E) and the corresponding CIE XYZ reconstructions produced by Raw-Diffusion reinders2025raw, CIE XYZ Net afifi2021cie, and our RawGen. The last row shows the ground-truth CIE XYZ image.
  • Figure 4: t-SNE visualization of VAE latent spaces for each sRGB-to-XYZ conversion method. Dashed contours indicate the estimated density region per method. RawGen produces the most compact cluster.
  • ...and 15 more figures