Table of Contents
Fetching ...

SNAP: A Benchmark for Testing the Effects of Capture Conditions on Fundamental Vision Tasks

Iuliia Kotseruba, John K. Tsotsos

TL;DR

This work investigates how camera capture conditions shape the generalization of vision models by examining capture bias in widely used datasets and introducing SNAP, a controlled benchmark with dense sampling of shutter speed, ISO sensitivity, and aperture under two lighting levels. By evaluating 23 classifiers, 16 detectors, and 13 vision-language models, plus a human baseline for VQA, the study shows that many models struggle to generalize even to well-exposed SNAP images and are highly sensitive to exposure and camera-parameter variations, highlighting biases inherited from training data. The findings reveal stronger degradation in object detection and a nuanced, prompting-dependent performance in VQA, with pretraining on large, diverse data helping but not erasing capture-related weaknesses. Overall, SNAP provides a rigorous, exposure-aware framework to quantify and address capture bias, with practical implications for dataset construction, model training, and real-world deployment of vision systems, and the authors release code and data at the provided repository.

Abstract

Generalization of deep-learning-based (DL) computer vision algorithms to various image perturbations is hard to establish and remains an active area of research. The majority of past analyses focused on the images already captured, whereas effects of the image formation pipeline and environment are less studied. In this paper, we address this issue by analyzing the impact of capture conditions, such as camera parameters and lighting, on DL model performance on 3 vision tasks -- image classification, object detection, and visual question answering (VQA). To this end, we assess capture bias in common vision datasets and create a new benchmark, SNAP (for $\textbf{S}$hutter speed, ISO se$\textbf{N}$sitivity, and $\textbf{AP}$erture), consisting of images of objects taken under controlled lighting conditions and with densely sampled camera settings. We then evaluate a large number of DL vision models and show the effects of capture conditions on each selected vision task. Lastly, we conduct an experiment to establish a human baseline for the VQA task. Our results show that computer vision datasets are significantly biased, the models trained on this data do not reach human accuracy even on the well-exposed images, and are susceptible to both major exposure changes and minute variations of camera settings. Code and data can be found at https://github.com/ykotseruba/SNAP

SNAP: A Benchmark for Testing the Effects of Capture Conditions on Fundamental Vision Tasks

TL;DR

This work investigates how camera capture conditions shape the generalization of vision models by examining capture bias in widely used datasets and introducing SNAP, a controlled benchmark with dense sampling of shutter speed, ISO sensitivity, and aperture under two lighting levels. By evaluating 23 classifiers, 16 detectors, and 13 vision-language models, plus a human baseline for VQA, the study shows that many models struggle to generalize even to well-exposed SNAP images and are highly sensitive to exposure and camera-parameter variations, highlighting biases inherited from training data. The findings reveal stronger degradation in object detection and a nuanced, prompting-dependent performance in VQA, with pretraining on large, diverse data helping but not erasing capture-related weaknesses. Overall, SNAP provides a rigorous, exposure-aware framework to quantify and address capture bias, with practical implications for dataset construction, model training, and real-world deployment of vision systems, and the authors release code and data at the provided repository.

Abstract

Generalization of deep-learning-based (DL) computer vision algorithms to various image perturbations is hard to establish and remains an active area of research. The majority of past analyses focused on the images already captured, whereas effects of the image formation pipeline and environment are less studied. In this paper, we address this issue by analyzing the impact of capture conditions, such as camera parameters and lighting, on DL model performance on 3 vision tasks -- image classification, object detection, and visual question answering (VQA). To this end, we assess capture bias in common vision datasets and create a new benchmark, SNAP (for hutter speed, ISO sesitivity, and erture), consisting of images of objects taken under controlled lighting conditions and with densely sampled camera settings. We then evaluate a large number of DL vision models and show the effects of capture conditions on each selected vision task. Lastly, we conduct an experiment to establish a human baseline for the VQA task. Our results show that computer vision datasets are significantly biased, the models trained on this data do not reach human accuracy even on the well-exposed images, and are susceptible to both major exposure changes and minute variations of camera settings. Code and data can be found at https://github.com/ykotseruba/SNAP

Paper Structure

This paper contains 20 sections, 21 figures, 5 tables.

Figures (21)

  • Figure 1: Normalized distribution of camera settings grouped by EV index for each dataset. Plots are arranged chronologically from bottom to top. Our SNAP dataset introduced in \ref{['sec:SNAP_dataset']} is shown for reference.
  • Figure 2: Data collection setup consists of a Canon EOS Rebel T7 camera on the tripod, a table covered with non-reflective cloth backdrop, and LED lights on either side.
  • Figure 3: Box plots show range and mean top-1 accuracy values for all models evaluated on SNAP. Blue circles show parameter sensitivity (PS) and red crosses mark top-1 accuracy on ImageNet.
  • Figure 4: Image classification top-1 (%) across exposure levels in SNAP. Each dot represents mean accuracy across all images in the corresponding EV offset bin.
  • Figure 5: Box plots show range and mean oLRP values for all object detection models evaluated on SNAP. Blue circles show parameter sensitivity (PS) w.r.t. oLRP.
  • ...and 16 more figures