Table of Contents
Fetching ...

Psych-Occlusion: Using Visual Psychophysics for Aerial Detection of Occluded Persons during Search and Rescue

Arturo Miguel Russell Bernal, Jane Cleland-Huang, Walter Scheirer

TL;DR

This paper tackles reliable aerial detection of occluded persons in emergency response by integrating human perceptual data into computer vision. It introduces Psych-ER, a large-scale human behavioral dataset collected from NOMAD images via MTurk to quantify how humans locate occluded targets at varying distances, and uses these insights to derive a psychophysical loss for bounding-box regression. The loss uses a center-focused Gaussian penalty whose variance $\sigma(d,v)$ is informed by human performance via $\sigma(d,v) = 100 - \mathrm{mAP}@0.00(d,v)$, yielding a loss $human\_loss(d,v) = A \cdot human\_penalty(d,v) + B \cdot (1 - human\_penalty(d,v)) \cdot default\_loss$ with $human\_penalty(d,v) = 1 - \exp(-((x_{pred}-x_{gt})^2+(y_{pred}-y_{gt})^2)/(2\,\sigma(d,v)^2))$. Evaluated on NOMAD with RetinaNet-R101-FPN, the psychophysical loss improves performance at longer distances and under occlusion without compromising near-distance accuracy, while incurring minimal training overhead and no extra inference cost. The work provides two key contributions: the Psych-ER dataset of human search behavior for aerial occluded views and a human-guided localization loss formulation, representing a first step toward human-informed localization in ER-specific CV. These results have practical implications for deploying more robust onboard CV systems on sUAS in time-critical rescue missions.

Abstract

The success of Emergency Response (ER) scenarios, such as search and rescue, is often dependent upon the prompt location of a lost or injured person. With the increasing use of small Unmanned Aerial Systems (sUAS) as "eyes in the sky" during ER scenarios, efficient detection of persons from aerial views plays a crucial role in achieving a successful mission outcome. Fatigue of human operators during prolonged ER missions, coupled with limited human resources, highlights the need for sUAS equipped with Computer Vision (CV) capabilities to aid in finding the person from aerial views. However, the performance of CV models onboard sUAS substantially degrades under real-life rigorous conditions of a typical ER scenario, where person search is hampered by occlusion and low target resolution. To address these challenges, we extracted images from the NOMAD dataset and performed a crowdsource experiment to collect behavioural measurements when humans were asked to "find the person in the picture". We exemplify the use of our behavioral dataset, Psych-ER, by using its human accuracy data to adapt the loss function of a detection model. We tested our loss adaptation on a RetinaNet model evaluated on NOMAD against increasing distance and occlusion, with our psychophysical loss adaptation showing improvements over the baseline at higher distances across different levels of occlusion, without degrading performance at closer distances. To the best of our knowledge, our work is the first human-guided approach to address the location task of a detection model, while addressing real-world challenges of aerial search and rescue. All datasets and code can be found at: https://github.com/ArtRuss/NOMAD.

Psych-Occlusion: Using Visual Psychophysics for Aerial Detection of Occluded Persons during Search and Rescue

TL;DR

This paper tackles reliable aerial detection of occluded persons in emergency response by integrating human perceptual data into computer vision. It introduces Psych-ER, a large-scale human behavioral dataset collected from NOMAD images via MTurk to quantify how humans locate occluded targets at varying distances, and uses these insights to derive a psychophysical loss for bounding-box regression. The loss uses a center-focused Gaussian penalty whose variance is informed by human performance via , yielding a loss with . Evaluated on NOMAD with RetinaNet-R101-FPN, the psychophysical loss improves performance at longer distances and under occlusion without compromising near-distance accuracy, while incurring minimal training overhead and no extra inference cost. The work provides two key contributions: the Psych-ER dataset of human search behavior for aerial occluded views and a human-guided localization loss formulation, representing a first step toward human-informed localization in ER-specific CV. These results have practical implications for deploying more robust onboard CV systems on sUAS in time-critical rescue missions.

Abstract

The success of Emergency Response (ER) scenarios, such as search and rescue, is often dependent upon the prompt location of a lost or injured person. With the increasing use of small Unmanned Aerial Systems (sUAS) as "eyes in the sky" during ER scenarios, efficient detection of persons from aerial views plays a crucial role in achieving a successful mission outcome. Fatigue of human operators during prolonged ER missions, coupled with limited human resources, highlights the need for sUAS equipped with Computer Vision (CV) capabilities to aid in finding the person from aerial views. However, the performance of CV models onboard sUAS substantially degrades under real-life rigorous conditions of a typical ER scenario, where person search is hampered by occlusion and low target resolution. To address these challenges, we extracted images from the NOMAD dataset and performed a crowdsource experiment to collect behavioural measurements when humans were asked to "find the person in the picture". We exemplify the use of our behavioral dataset, Psych-ER, by using its human accuracy data to adapt the loss function of a detection model. We tested our loss adaptation on a RetinaNet model evaluated on NOMAD against increasing distance and occlusion, with our psychophysical loss adaptation showing improvements over the baseline at higher distances across different levels of occlusion, without degrading performance at closer distances. To the best of our knowledge, our work is the first human-guided approach to address the location task of a detection model, while addressing real-world challenges of aerial search and rescue. All datasets and code can be found at: https://github.com/ArtRuss/NOMAD.

Paper Structure

This paper contains 13 sections, 4 equations, 11 figures.

Figures (11)

  • Figure 1: Development of Psych-ER, our behavioral dataset for Emergency Response (ER) aerial search, and its derived psychophysical loss. Integration of sUAS into ER scenarios have aided first responders and rescued victims news_drowningnews_hikingAfricanews_lostWoodsnews_earthquakenews_girlnews_firenews_teamworknews_oldadultnews_firehomes (first column). Onboard Computer Vision (CV) is a key component for the full integration of sUAS into ER missions; to address the inherent challenges of CV for ER scenarios we had previously published NOMADrussell2024nomad, an ER dedicated dataset composed of 42,825 aerial images, filmed from five aerial distances and providing a label detailing the degree of occlusion of each person's bounding box (second column). With the goal of improving CV performance for ER, we extracted a comprehensive subset of NOMAD and built an MTurk experiment to collect human behavioral measurements when facing the task "Find the person in the picture" (third column). Our behavioral dataset Psych-ER contains (1) human search behavioral data for more than 5000 images collected through the participant's screen cursor, including location and area of attention, as well as duration at every attention location; (2) accuracy of their selection relative to NOMAD's ground truth bounding box; (3) response time for every image (fourth column). Finally, we used the human accuracy data to formulate a psychophysical loss, and evaluated our loss performance on the RetinaNet architecture.
  • Figure 2: Sample interactive picture shown during the instructions of the MTurk "Find the person in the picture" survey, explaining the use of the provided zoom-magnifying-glass.
  • Figure 3: Samples shown to workers prior to their survey experiment. The black circles represent the recorded location and area of the zoom-magnifying-glass. (a) Good work sample showing a thorough search. (b) Bad work sample, showing no search.
  • Figure 4: Display of the experiment setup, showing one image at a time, with reminders of key instructions.
  • Figure 5: Histograms of IoU values $>$ 0, between the worker's area selection and the ground truth bounding box, with the smallest IoU-range bin of the histograms containing the most samples (highlighted bins), showing that workers were focusing on location rather than tightness of their area selection. The x-axis represents the visibility levels and the y-axis represents the distance levels.
  • ...and 6 more figures