Table of Contents
Fetching ...

What's color got to do with it? Face recognition in grayscale

Aman Bhatta, Domingo Mery, Haiyu Wu, Joyce Annan, Micheal C. King, Kevin W. Bowyer

TL;DR

This work investigates whether color information is necessary for state-of-the-art face recognition using deep CNNs. Through extensive experiments on color and grayscale data, including RGB and HSV color spaces, across multiple backbones and datasets (e.g., MORPH, IJB-B, IJB-C), it shows that deeper models achieve nearly identical accuracy when trained on grayscale versus color, even when tested on color images. The study reveals that color cues contribute little to identity discrimination, with the first convolutional layer often effectively performing grayscale conversion, and that color-space changes (HSV vs RGB) do not yield consistent gains. It also demonstrates practical benefits of grayscale data, such as reduced storage and opportunities to augment training data for improved performance. These findings have implications for dataset curation, training efficiency, and the deployment of face recognition systems in real-world, varied lighting conditions.

Abstract

State-of-the-art deep CNN face matchers are typically created using extensive training sets of color face images. Our study reveals that such matchers attain virtually identical accuracy when trained on either grayscale or color versions of the training set, even when the evaluation is done using color test images. Furthermore, we demonstrate that shallower models, lacking the capacity to model complex representations, rely more heavily on low-level features such as those associated with color. As a result, they display diminished accuracy when trained with grayscale images. We then consider possible causes for deeper CNN face matchers "not seeing color". Popular web-scraped face datasets actually have 30 to 60% of their identities with one or more grayscale images. We analyze whether this grayscale element in the training set impacts the accuracy achieved, and conclude that it does not. We demonstrate that using only grayscale images for both training and testing achieves accuracy comparable to that achieved using only color images for deeper models. This holds true for both real and synthetic training datasets. HSV color space, which separates chroma and luma information, does not improve the network's learning about color any more than in the RGB color space. We then show that the skin region of an individual's images in a web-scraped training set exhibits significant variation in their mapping to color space. This suggests that color carries limited identity-specific information. We also show that when the first convolution layer is restricted to a single filter, models learn a grayscale conversion filter and pass a grayscale version of the input color image to the next layer. Finally, we demonstrate that leveraging the lower per-image storage for grayscale to increase the number of images in the training set can improve accuracy of the face recognition model.

What's color got to do with it? Face recognition in grayscale

TL;DR

This work investigates whether color information is necessary for state-of-the-art face recognition using deep CNNs. Through extensive experiments on color and grayscale data, including RGB and HSV color spaces, across multiple backbones and datasets (e.g., MORPH, IJB-B, IJB-C), it shows that deeper models achieve nearly identical accuracy when trained on grayscale versus color, even when tested on color images. The study reveals that color cues contribute little to identity discrimination, with the first convolutional layer often effectively performing grayscale conversion, and that color-space changes (HSV vs RGB) do not yield consistent gains. It also demonstrates practical benefits of grayscale data, such as reduced storage and opportunities to augment training data for improved performance. These findings have implications for dataset curation, training efficiency, and the deployment of face recognition systems in real-world, varied lighting conditions.

Abstract

State-of-the-art deep CNN face matchers are typically created using extensive training sets of color face images. Our study reveals that such matchers attain virtually identical accuracy when trained on either grayscale or color versions of the training set, even when the evaluation is done using color test images. Furthermore, we demonstrate that shallower models, lacking the capacity to model complex representations, rely more heavily on low-level features such as those associated with color. As a result, they display diminished accuracy when trained with grayscale images. We then consider possible causes for deeper CNN face matchers "not seeing color". Popular web-scraped face datasets actually have 30 to 60% of their identities with one or more grayscale images. We analyze whether this grayscale element in the training set impacts the accuracy achieved, and conclude that it does not. We demonstrate that using only grayscale images for both training and testing achieves accuracy comparable to that achieved using only color images for deeper models. This holds true for both real and synthetic training datasets. HSV color space, which separates chroma and luma information, does not improve the network's learning about color any more than in the RGB color space. We then show that the skin region of an individual's images in a web-scraped training set exhibits significant variation in their mapping to color space. This suggests that color carries limited identity-specific information. We also show that when the first convolution layer is restricted to a single filter, models learn a grayscale conversion filter and pass a grayscale version of the input color image to the next layer. Finally, we demonstrate that leveraging the lower per-image storage for grayscale to increase the number of images in the training set can improve accuracy of the face recognition model.
Paper Structure (17 sections, 3 equations, 5 figures, 7 tables)

This paper contains 17 sections, 3 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Model Trained with RGB Images Exhibits Similar Performance When Applied To Grayscale Images From Diverse Demographics. This suggests that using grayscale images do not disproportionately influence any specific demographic group. Each image pair presented in the plot has similarity score for original RGB version and grayscale version. For each demographic, throughout the range of similarity, the cloud of points trends on the 45-degree line. If grayscale gave consistently lower similarity score, the cloud should trend below the 45-degree line. Top Row Figures - ResNet backbone, ArcFace loss, glint training set. Bottom Row Figures - COTS Matcher. Both tested on MORPH dataset.
  • Figure 2: Example image pairs from MORPH with grayscale similarity greater than RGB similarity. Note that this result is from a matcher trained on RGB images. Match Score ($S_c$) reported on ArcFace loss trained on Glint360kInsightface
  • Figure 3: Visualization of 64 Convolution Filter Weight Values of the First Convolution Block in row-major order. Approximately one-third of the convolution blocks dedicated to RGB data reached nearly zero values, while the remaining blocks exhibited a strikingly similar pattern across the RGB planes. However, only a few convolution blocks displayed significantly different values for the RGB planes. In contrast, when considering the HSV plane, it appears that the most active block primarily derives its information from the V plane, indicating that the network learns to extract more valuable data from this particular plane compared to the others. In three-channel grayscale training, all three planes contain the same information, which results in same values across the planes for all filters. Backbone: ResNet-50, Loss: ArcFace
  • Figure 4: Fraction of nearest neighbor of an image of an identity, when mapped to the RGB space, appears to be nearly random.
  • Figure 5: The first and last columns display the activation maps alongside the learned filters for models trained with one, two, and four filters in the first layer. The sum of elements for each filter, as expressed in Equation \ref{['kernelsum']}, is displayed above each respective filter. The middle column illustrates the vector projection of these learned filters. (Projection vectors betters visualized in color)