Table of Contents
Fetching ...

Learning deep illumination-robust features from multispectral filter array images

Anis Amziane

TL;DR

Experiments on MS image classification show that the original approach to learn discriminant and illumination-robust features directly from raw images outperforms both handcrafted and recent deep learning-based methods, while also requiring significantly less computational effort.

Abstract

Multispectral (MS) snapshot cameras equipped with a MS filter array (MSFA), capture multiple spectral bands in a single shot, resulting in a raw mosaic image where each pixel holds only one channel value. The fully-defined MS image is estimated from the raw one through \textit{demosaicing}, which inevitably introduces spatio-spectral artifacts. Moreover, training on fully-defined MS images can be computationally intensive, particularly with deep neural networks (DNNs), and may result in features lacking discrimination power due to suboptimal learning of spatio-spectral interactions. Furthermore, outdoor MS image acquisition occurs under varying lighting conditions, leading to illumination-dependent features. This paper presents an original approach to learn discriminant and illumination-robust features directly from raw images. It involves: \textit{raw spectral constancy} to mitigate the impact of illumination, \textit{MSFA-preserving} transformations suited for raw image augmentation to train DNNs on diverse raw textures, and \textit{raw-mixing} to capture discriminant spatio-spectral interactions in raw images. Experiments on MS image classification show that our approach outperforms both handcrafted and recent deep learning-based methods, while also requiring significantly less computational effort. The source code is available at https://github.com/AnisAmziane/RawTexture.

Learning deep illumination-robust features from multispectral filter array images

TL;DR

Experiments on MS image classification show that the original approach to learn discriminant and illumination-robust features directly from raw images outperforms both handcrafted and recent deep learning-based methods, while also requiring significantly less computational effort.

Abstract

Multispectral (MS) snapshot cameras equipped with a MS filter array (MSFA), capture multiple spectral bands in a single shot, resulting in a raw mosaic image where each pixel holds only one channel value. The fully-defined MS image is estimated from the raw one through \textit{demosaicing}, which inevitably introduces spatio-spectral artifacts. Moreover, training on fully-defined MS images can be computationally intensive, particularly with deep neural networks (DNNs), and may result in features lacking discrimination power due to suboptimal learning of spatio-spectral interactions. Furthermore, outdoor MS image acquisition occurs under varying lighting conditions, leading to illumination-dependent features. This paper presents an original approach to learn discriminant and illumination-robust features directly from raw images. It involves: \textit{raw spectral constancy} to mitigate the impact of illumination, \textit{MSFA-preserving} transformations suited for raw image augmentation to train DNNs on diverse raw textures, and \textit{raw-mixing} to capture discriminant spatio-spectral interactions in raw images. Experiments on MS image classification show that our approach outperforms both handcrafted and recent deep learning-based methods, while also requiring significantly less computational effort. The source code is available at https://github.com/AnisAmziane/RawTexture.
Paper Structure (18 sections, 7 equations, 7 figures, 3 tables)

This paper contains 18 sections, 7 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Fully-defined (a) vs. proposed (b) raw feature extraction pipeline. (a) requires estimating the fully-defined image first before post-processing and feature extraction, while (b) directly exploits the acquired raw image.
  • Figure 2: MSFA-based snapshot imaging: shown is the IMEC NIR $5\times5$ MSFA ($\lambda^{b}\in\{678\,nm,~\dots,~960\,nm\}$, $b\in[\![0, 24]\!]$), composed of $B^{2}=25$ optical filters arranged in a $5\times5$ square basic pattern. It samples the scene radiance according the SSF $S^{b}(\lambda)$ of each filter, and provides a raw image whose pixels contain a single channel value.
  • Figure 3: Example of MSFA-preserving augmentations. The direct vertical flip in (a) provides an augmented raw image with a flipped basic pattern while our proposed vertical flip (b) preserves its structure, crucial for learning from raw images. (c) and (e) are translations along the y and x-axis, respectively. (d) is the horizontal flip and (f) the texture remodeling augmentation. Basic patterns are enlarged for better visualization.
  • Figure 4: RawMixer architecture. $\Phi_{mixer}$ learns deep spatio-spectral interactions guided by the MSFA basic pattern. $\Phi_t$ is the (positional encoding free) transformer encoder that takes in the $320\cdot (\lfloor{\frac{m}{2}\rfloor}\times \lfloor{\frac{m}{2}\rfloor})$ feature maps provided by $\Phi_{mixer}$ reshaped as $\lfloor{\frac{m}{2}\rfloor}^2$ tokens $\times$$320$ features. It learns another embedding through self-attention and feed-forward layers. The depth of the ConvMixer and transformer encoders is set to 2 in our experiments. SELU: scaled exponential linear unit, BN: batch normalization, FC: fully-connected layer. Filter depths (1 in raw conv. layer, 320 for ConvMixer layer) are not shown for sake of clarity.
  • Figure 5: Considered MSFAs: (a) IMEC VIS $4\times4$ ($\lambda^{b}\in\{469\,nm,~\dots,~633\,nm\}$, $b\in[\![0, 15]\!]$), (b) NIR $5\times5$ ($\lambda^{b}\in\{678\,nm,~\dots,~960\,nm\}$, $b\in[\![0, 24]\!]$), (c) and (c) VIS-NIR $2\times2$ ($\lambda^{b}\in\{465\,nm,~\dots,~811\,nm\}$, $b\in[\![0, 3]\!]$).
  • ...and 2 more figures