Table of Contents
Fetching ...

RAWMamba: Unified sRGB-to-RAW De-rendering With State Space Model

Hongjun Chen, Wencheng Han, Huan Zheng, Jianbing Shen

TL;DR

The core of RAWMamba is the Unified Metadata Embedding module, which harmonizes diverse metadata types into a unified representation, and a multi-perspective affinity modeling method is proposed to promote the extraction of reference information.

Abstract

Recent advancements in sRGB-to-RAW de-rendering have increasingly emphasized metadata-driven approaches to reconstruct RAW data from sRGB images, supplemented by partial RAW information. In image-based de-rendering, metadata is commonly obtained through sampling, whereas in video tasks, it is typically derived from the initial frame. The distinct metadata requirements necessitate specialized network architectures, leading to architectural incompatibilities that increase deployment complexity. In this paper, we propose RAWMamba, a Mamba-based unified framework developed for sRGB-to-RAW de-rendering across both image and video domains. The core of RAWMamba is the Unified Metadata Embedding (UME) module, which harmonizes diverse metadata types into a unified representation. In detail, a multi-perspective affinity modeling method is proposed to promote the extraction of reference information. In addition, we introduce the Local Tone-Aware Mamba (LTA-Mamba) module, which captures long-range dependencies to enable effective global propagation of metadata. Experimental results demonstrate that the proposed RAWMamba achieves state-of-the-art performance, yielding high-quality RAW data reconstruction.

RAWMamba: Unified sRGB-to-RAW De-rendering With State Space Model

TL;DR

The core of RAWMamba is the Unified Metadata Embedding module, which harmonizes diverse metadata types into a unified representation, and a multi-perspective affinity modeling method is proposed to promote the extraction of reference information.

Abstract

Recent advancements in sRGB-to-RAW de-rendering have increasingly emphasized metadata-driven approaches to reconstruct RAW data from sRGB images, supplemented by partial RAW information. In image-based de-rendering, metadata is commonly obtained through sampling, whereas in video tasks, it is typically derived from the initial frame. The distinct metadata requirements necessitate specialized network architectures, leading to architectural incompatibilities that increase deployment complexity. In this paper, we propose RAWMamba, a Mamba-based unified framework developed for sRGB-to-RAW de-rendering across both image and video domains. The core of RAWMamba is the Unified Metadata Embedding (UME) module, which harmonizes diverse metadata types into a unified representation. In detail, a multi-perspective affinity modeling method is proposed to promote the extraction of reference information. In addition, we introduce the Local Tone-Aware Mamba (LTA-Mamba) module, which captures long-range dependencies to enable effective global propagation of metadata. Experimental results demonstrate that the proposed RAWMamba achieves state-of-the-art performance, yielding high-quality RAW data reconstruction.

Paper Structure

This paper contains 15 sections, 18 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Comparison with Previous sRGB-to-RAW De-rendering Approaches. (a) In image de-rendering methods like spatiallyawarecamli2023metadata, sampled RAW data is utilized as metadata, followed by spatial recovery operations to extract information. (b) For video de-rendering methods videoraw, the first frame serves as metadata, with a sequential model applied for information extraction. (c) Our RAWMamba method presents a unified framework capable of handling both image and video inputs.
  • Figure 2: Overview of the Proposed RAWMamba. In RAWMamba, inputs are processed through the UME module and the main network. The UME module encodes images and metadata into feature spaces, generating global and local metadata embeddings via affinity-based blocks (GEB and LEB). The main network query the metadata embeddings to refine features, which are aggregated input into the LTA-Mamba module. LTA-Mamba employs local and global spatiotemporal scanning to enhance global consistency for the final reconstruction.
  • Figure 3: The details of the LTA-Mamba module. The LTA-Mamba module relies on two consecutive Mamba blocks. In the figure, the DWConv block represents a depth-wise convolution layer, while CA denotes a channel attention block. (a) The implementation of bi-direction mamba. (b) An illustration of the global scan strategy and the local scan strategy.
  • Figure 4: Visual comparisons on image dataset CAM and video dataset RVD-part2. As shown in this figure, our unified model achieves the highest accuracy and lowest error compared to other methods, across both image and video datasets.