Table of Contents
Fetching ...

Event Camera Demosaicing via Swin Transformer and Pixel-focus Loss

Yunfan Lu, Yijie Xu, Wenzong Ma, Weiyu Guo, Hui Xiong

TL;DR

This paper tackles RAW-domain demosaicing for event cameras, where sensor design causes missing pixel values that hinder conventional RAW processing. It introduces a Swin-Transformer backbone with space-to-depth preprocessing and a U-Net–style encoder–decoder, plus a two-stage training strategy that first uses Charbonnier loss and then fine-tunes with Pixel Focus Loss to emphasize edge regions. The Pixel Focus Loss includes two forms, $\mathcal{L}_{pf}^p$ and $\mathcal{L}_{pf}^e$, addressing long-tail error distributions and improving convergence. Evaluations on the MIPI Demosaic Challenge dataset show improved reconstruction quality over baselines like RSTCANet, and the authors provide code and trained models to facilitate adoption and further RAW-domain research.

Abstract

Recent research has highlighted improvements in high-quality imaging guided by event cameras, with most of these efforts concentrating on the RGB domain. However, these advancements frequently neglect the unique challenges introduced by the inherent flaws in the sensor design of event cameras in the RAW domain. Specifically, this sensor design results in the partial loss of pixel values, posing new challenges for RAW domain processes like demosaicing. The challenge intensifies as most research in the RAW domain is based on the premise that each pixel contains a value, making the straightforward adaptation of these methods to event camera demosaicing problematic. To end this, we present a Swin-Transformer-based backbone and a pixel-focus loss function for demosaicing with missing pixel values in RAW domain processing. Our core motivation is to refine a general and widely applicable foundational model from the RGB domain for RAW domain processing, thereby broadening the model's applicability within the entire imaging process. Our method harnesses multi-scale processing and space-to-depth techniques to ensure efficiency and reduce computing complexity. We also proposed the Pixel-focus Loss function for network fine-tuning to improve network convergence based on our discovery of a long-tailed distribution in training loss. Our method has undergone validation on the MIPI Demosaic Challenge dataset, with subsequent analytical experimentation confirming its efficacy. All code and trained models are released here: https://github.com/yunfanLu/ev-demosaic

Event Camera Demosaicing via Swin Transformer and Pixel-focus Loss

TL;DR

This paper tackles RAW-domain demosaicing for event cameras, where sensor design causes missing pixel values that hinder conventional RAW processing. It introduces a Swin-Transformer backbone with space-to-depth preprocessing and a U-Net–style encoder–decoder, plus a two-stage training strategy that first uses Charbonnier loss and then fine-tunes with Pixel Focus Loss to emphasize edge regions. The Pixel Focus Loss includes two forms, and , addressing long-tail error distributions and improving convergence. Evaluations on the MIPI Demosaic Challenge dataset show improved reconstruction quality over baselines like RSTCANet, and the authors provide code and trained models to facilitate adoption and further RAW-domain research.

Abstract

Recent research has highlighted improvements in high-quality imaging guided by event cameras, with most of these efforts concentrating on the RGB domain. However, these advancements frequently neglect the unique challenges introduced by the inherent flaws in the sensor design of event cameras in the RAW domain. Specifically, this sensor design results in the partial loss of pixel values, posing new challenges for RAW domain processes like demosaicing. The challenge intensifies as most research in the RAW domain is based on the premise that each pixel contains a value, making the straightforward adaptation of these methods to event camera demosaicing problematic. To end this, we present a Swin-Transformer-based backbone and a pixel-focus loss function for demosaicing with missing pixel values in RAW domain processing. Our core motivation is to refine a general and widely applicable foundational model from the RGB domain for RAW domain processing, thereby broadening the model's applicability within the entire imaging process. Our method harnesses multi-scale processing and space-to-depth techniques to ensure efficiency and reduce computing complexity. We also proposed the Pixel-focus Loss function for network fine-tuning to improve network convergence based on our discovery of a long-tailed distribution in training loss. Our method has undergone validation on the MIPI Demosaic Challenge dataset, with subsequent analytical experimentation confirming its efficacy. All code and trained models are released here: https://github.com/yunfanLu/ev-demosaic
Paper Structure (15 sections, 4 equations, 7 figures, 3 tables)

This paper contains 15 sections, 4 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Contemporary design of an actual event camera sensor (Hybridevs sensor), featuring red, green, and blue pixels for outputting RGB RAW signals. Black pixels in the lower right corner of the green and red areas are designated for event signal output, and white pixels do not emit any signals. The demosaicing task aims to convert a RAW image with RGB signals and black holes (a) into a full-color image with three RGB channels (b).
  • Figure 2: Visual results of two images at different stages of training. Example (I) displays an image with less edge and texture, featuring extensive areas of sky and lake, while example (II) presents an image rich in edge and texture, including animal fur and splashing water. For these two examples, four groups of reconstruction results are shown under varying PSNR values, along with different maps and difference distributions. Here, "difference" refers to the absolute value of discrepancies compared to the ground truth. As PSNR increases, the differences exhibit a long-tailed distribution.
  • Figure 3: Overview of the event camera demosaicing method. The input RAW image is first preprocessed using space-to-depth and 1$\times$1 convolution operations. The encoder then extracts multi-scale features using Swin Transformer blocks with the shifted window mechanism. The decoder mirrors the encoder's structure and incorporates skip connections to recover spatial details. Finally, the reconstruction module generates the output RGB image. (a) Encoder block architecture. (b) Shifted window mechanism for cross-window interactions. (c) Decoder block architecture.
  • Figure 4: Loss functions visualization. (a) (b) and (c) refer to Charbonnier and pixel-focus loss with the power and the exponential function, respectively. The line charts loss functions within the 0-1 range and their gradients. It also provides a magnified view of the 0 to 0.1 interval to observe the characteristics of different loss functions better when dealing with long-tail distributions.
  • Figure 5: Visualized results of our method and compared method - RSTCANet xing2022residual. Comparison methods 1, 2, and 3, respectively represent the processing directly on the original RAW, processing after converting the original RAW into Bayer Pattern, and the results after fine-tuning RSTCANet xing2022residual.
  • ...and 2 more figures