Table of Contents
Fetching ...

Raw or Cooked? Object Detection on RAW Images

William Ljungbergh, Joakim Johnander, Christoffer Petersson, Michael Felsberg

TL;DR

The paper challenges the assumption that camera ISP pipelines optimized for visually pleasing RGB images are optimal for deep vision tasks. It proposes a Bayer-pattern preserving downsampling stage plus three lightweight, learnable RAW processing operations ($F_\gamma$, $F_{erf}$, $F_{YJ}$) trained end-to-end with an object detector, and demonstrates improvements on the PASCALRAW dataset. Notably, Learnable Yeo-Johnson achieves the highest accuracy, $AP=52.6$, surpassing the RGB baseline by about $2.1$ AP points, while the naïve RAW RGGB input performs much worse. These results indicate that task-driven optimization of RAW-to-feature transformations can unlock robust object detection, particularly under challenging lighting conditions, with implications for camera pipelines and low-light vision systems.

Abstract

Images fed to a deep neural network have in general undergone several handcrafted image signal processing (ISP) operations, all of which have been optimized to produce visually pleasing images. In this work, we investigate the hypothesis that the intermediate representation of visually pleasing images is sub-optimal for downstream computer vision tasks compared to the RAW image representation. We suggest that the operations of the ISP instead should be optimized towards the end task, by learning the parameters of the operations jointly during training. We extend previous works on this topic and propose a new learnable operation that enables an object detector to achieve superior performance when compared to both previous works and traditional RGB images. In experiments on the open PASCALRAW dataset, we empirically confirm our hypothesis.

Raw or Cooked? Object Detection on RAW Images

TL;DR

The paper challenges the assumption that camera ISP pipelines optimized for visually pleasing RGB images are optimal for deep vision tasks. It proposes a Bayer-pattern preserving downsampling stage plus three lightweight, learnable RAW processing operations (, , ) trained end-to-end with an object detector, and demonstrates improvements on the PASCALRAW dataset. Notably, Learnable Yeo-Johnson achieves the highest accuracy, , surpassing the RGB baseline by about AP points, while the naïve RAW RGGB input performs much worse. These results indicate that task-driven optimization of RAW-to-feature transformations can unlock robust object detection, particularly under challenging lighting conditions, with implications for camera pipelines and low-light vision systems.

Abstract

Images fed to a deep neural network have in general undergone several handcrafted image signal processing (ISP) operations, all of which have been optimized to produce visually pleasing images. In this work, we investigate the hypothesis that the intermediate representation of visually pleasing images is sub-optimal for downstream computer vision tasks compared to the RAW image representation. We suggest that the operations of the ISP instead should be optimized towards the end task, by learning the parameters of the operations jointly during training. We extend previous works on this topic and propose a new learnable operation that enables an object detector to achieve superior performance when compared to both previous works and traditional RGB images. In experiments on the open PASCALRAW dataset, we empirically confirm our hypothesis.
Paper Structure (14 sections, 6 equations, 4 figures, 1 table)

This paper contains 14 sections, 6 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: Three qualitative examples from the PASCALRAW dataset. We show the ground-truth (top), the RGB baseline detector (center), and the RAW RGGB detector with a learnable Yeo-Johnson operation (bottom). Compared to the RGB baseline, our proposed RAW RGGB detector manages to detect objects subject to poor light conditions.
  • Figure 2: Downsampling method for Bayer-pattern RAW data. Each of the colors in the filter array of the downsampled RAW image (right) is the average over all cells in the corresponding region in the original image with the same color (left and center). The figure illustrates the downsampling of an original image patch of size $2d\times2d$ (with $d=5$ in this example), down to a patch of size $2\times2$, i.e. with a downsampling factor $d$ in each dimension.
  • Figure 3: Traditional (A), naïve (B), and proposed (C) detection pipelines. The traditional pipeline uses a set of common image signal processing operations, such as Demosaicing, Denoising, and Tonemapping, and then feeds the object detector with the processed RGB images. The naïve pipeline feeds the RAW image directly into the detector while our proposed pipeline first feeds the RAW image through a learnable non-linear operation, $F$, which can be viewed as being part of the end-to-end trainable object detection network.
  • Figure 4: Evolution of the learnable parameter $\lambda$ during the entire training (top-right), the distribution of the RAW pixel values in PASCAL RAW (bottom-right), and the functional form -- before and after training -- of the Learnable Yeo-Johnson operation (left). In the left plot, the output activation values are shown across the full input range $[0, 2^{12}-1$].