Efficient Visual Computing with Camera RAW Snapshots

Zhihao Li; Ming Lu; Xu Zhang; Xin Feng; M. Salman Asif; Zhan Ma

Efficient Visual Computing with Camera RAW Snapshots

Zhihao Li, Ming Lu, Xu Zhang, Xin Feng, M. Salman Asif, Zhan Ma

TL;DR

This work introduces a RAW-domain vision paradigm, ρ-Vision, that bypasses traditional ISP processing to perform object detection and image compression directly on camera RAW data. By learning Unpaired CycleR2R with an invISP module, it can generate realistic simRAW data from RGB sources and fine-tune RGB-trained models to operate in the RAW domain. Empirical results across detection, classification, and segmentation demonstrate consistent RAW-domain advantages in accuracy, latency, and efficiency, with notable robustness under low-light and HDR conditions. The approach also delivers hardware-aware benefits by reducing computation and memory demands and enabling hardware pipelines that avoid ISP-induced bottlenecks. Overall, ρ-Vision offers a practical path toward faster, more efficient visual computing on diverse sensors without reliance on specialized ISP configurations.

Abstract

Conventional cameras capture image irradiance on a sensor and convert it to RGB images using an image signal processor (ISP). The images can then be used for photography or visual computing tasks in a variety of applications, such as public safety surveillance and autonomous driving. One can argue that since RAW images contain all the captured information, the conversion of RAW to RGB using an ISP is not necessary for visual computing. In this paper, we propose a novel $ρ$-Vision framework to perform high-level semantic understanding and low-level compression using RAW images without the ISP subsystem used for decades. Considering the scarcity of available RAW image datasets, we first develop an unpaired CycleR2R network based on unsupervised CycleGAN to train modular unrolled ISP and inverse ISP (invISP) models using unpaired RAW and RGB images. We can then flexibly generate simulated RAW images (simRAW) using any existing RGB image dataset and finetune different models originally trained for the RGB domain to process real-world camera RAW images. We demonstrate object detection and image compression capabilities in RAW-domain using RAW-domain YOLOv3 and RAW image compressor (RIC) on snapshots from various cameras. Quantitative results reveal that RAW-domain task inference provides better detection accuracy and compression compared to RGB-domain processing. Furthermore, the proposed \r{ho}-Vision generalizes across various camera sensors and different task-specific models. Additional advantages of the proposed $ρ$-Vision that eliminates the ISP are the potential reductions in computations and processing times.

Efficient Visual Computing with Camera RAW Snapshots

TL;DR

Abstract

-Vision framework to perform high-level semantic understanding and low-level compression using RAW images without the ISP subsystem used for decades. Considering the scarcity of available RAW image datasets, we first develop an unpaired CycleR2R network based on unsupervised CycleGAN to train modular unrolled ISP and inverse ISP (invISP) models using unpaired RAW and RGB images. We can then flexibly generate simulated RAW images (simRAW) using any existing RGB image dataset and finetune different models originally trained for the RGB domain to process real-world camera RAW images. We demonstrate object detection and image compression capabilities in RAW-domain using RAW-domain YOLOv3 and RAW image compressor (RIC) on snapshots from various cameras. Quantitative results reveal that RAW-domain task inference provides better detection accuracy and compression compared to RGB-domain processing. Furthermore, the proposed \r{ho}-Vision generalizes across various camera sensors and different task-specific models. Additional advantages of the proposed

-Vision that eliminates the ISP are the potential reductions in computations and processing times.

Paper Structure (19 sections, 9 equations, 11 figures, 6 tables)

This paper contains 19 sections, 9 equations, 11 figures, 6 tables.

A Real-World Hardware Implementation
Hardware System for Comparative Benchmark
Experimental Analysis
Details of the Unpaired CycleR2R
Architecture of Basic Neural Network
Architectures of Discriminators
Gamma Correction Standard
Details of Distribution Analysis of RAW images
The proof of the equation \ref{['eq:var_gradient_weights']}
The proof of the equation \ref{['eq:simp_var_gradient_weights']}
RAW-domain Classification
Datasets and Baselines
Comparative Studies of RAW-domain Classification
RAW-domain Segmentation
Datasets
...and 4 more sections

Figures (11)

Figure S1: RGB-Vision vs. $\rho$-Vision. (a) The hardware system uses AX620A AI SoC. A UC96B power meter is connected for measurement; (b) $\rho$-Vision framework trains and tests models using RAW images directly, completely bypassing the ISP; (c) Traditional RGB-Vision framework requires the ISP to generate RGB images for model training and testing; (d) Average Gains of $\rho$-Vision to RGB-Vision. Metrics are normalized to the results generated by the RGB-Vision pipeline.
Figure S2: Impact of ISP used in RGB-Vision on the detection task. The setup of "Training ISP$\rightarrow$Testing ISP" indicates the "Training ISP" used to generate RGB images for training and the "Testing ISP" used to generate RGB images for testing respectively. Default parameters used by the ISP are marked with "(D)" and expert-tuned parameters used by the ISP are annotated with "(T)". The first two columns illustrate domain discrepancies when training and testing using different ISPs, while the last two columns demonstrate how ISP quality (with expert tuning) affects object detection accuracy. Zoom for better details.
Figure S3: Visualization of Classifier Response to Noisy and Clean Inputs The "RGB" rows represent the processing using the Anscombe ISP diamond2021dirty where it inputs the RGB image for classification; In contrast, the "RAW" rows stand for the processing using Unpaired CycleR2R where the RAW images are directly processed. Noise is augmented upon the clean inputs to form Noisy samples. The "Noise Channel" is the feature channel in the shallow layer "Conv2d_0" that presents the maximum difference when processing the noise and clean inputs respectively. The Grad-CAM selvaraju2017grad visualizations are based on the last convolutional layer "Conv2d_13_pointwise". A comparison between the "Noise Channel" under different inputs reveals that the RAW-domain classifier is adept at extracting noise patterns, effectively separating noise from the signal, which results in Grad-CAM visualizations that more closely resemble the clean input. In contrast, the RGB-domain classifier struggles to disentangle noise from the signal due to the complex non-linear processing by the Anscombe ISP, leading to significant deviations in Grad-CAM under noisy conditions and consequently to misclassification.
Figure S4: Qualitative Visualization of Pretrained RAW Segmentation Model. Example predictions show better recognition of buildings, sky, and traffic lights by our Unpaired CycleR2R on Cityscapes RGB $\rightarrow$ iPhone RAW. Gamma correction and brightness adjustment have been applied to RAW images for a better view.
Figure S5: Few-shot finetuning using limited camera RAWs. The simRAW-pretrained HRNetv2 wang2020deep is obtained by using samples in simRAW$_\text{c}$ generated by our Unpaired CycleR2R, which is then finetuned using limited camera RAW images; and the "scratch" model is randomly initialized and then trained using the same number of labeled real RAW images.
...and 6 more figures

Efficient Visual Computing with Camera RAW Snapshots

TL;DR

Abstract

Efficient Visual Computing with Camera RAW Snapshots

Authors

TL;DR

Abstract

Table of Contents

Figures (11)