Table of Contents
Fetching ...

Cost-Efficient Multi-Scale Fovea for Semantic-Based Visual Search Attention

João Luzio, Alexandre Bernardino, Plinio Moreno

Abstract

Semantics are one of the primary sources of top-down preattentive information. Modern deep object detectors excel at extracting such valuable semantic cues from complex visual scenes. However, the size of the visual input to be processed by these detectors can become a bottleneck, particularly in terms of time costs, affecting an artificial attention system's biological plausibility and real-time deployability. Inspired by classical exponential density roll-off topologies, we apply a new artificial foveation module to our novel attention prediction pipeline: the Semantic-based Bayesian Attention (SemBA) framework. We aim at reducing detection-related computational costs without compromising visual task accuracy, thereby making SemBA more biologically plausible. The proposed multi-scale pyramidal field-of-view retains maximum acuity at an innermost level, around a focal point, while gradually increasing distortion for outer levels to mimic peripheral uncertainty via downsampling. In this work we evaluate the performance of our novel Multi-Scale Fovea, incorporated into \textit{SemBA}, on target-present visual search. We also compare it against other artificial foveal systems, and conduct ablation studies with different deep object detection models to assess the impact of the new topology in terms of computational costs. We experimentally demonstrate that including the new Multi-Scale Fovea module effectively reduces inherent processing costs while improving SemBA's scanpath prediction accuracy. Remarkably, we show that SemBA closely approximates human consistency while retaining the actual human fovea's proportions.

Cost-Efficient Multi-Scale Fovea for Semantic-Based Visual Search Attention

Abstract

Semantics are one of the primary sources of top-down preattentive information. Modern deep object detectors excel at extracting such valuable semantic cues from complex visual scenes. However, the size of the visual input to be processed by these detectors can become a bottleneck, particularly in terms of time costs, affecting an artificial attention system's biological plausibility and real-time deployability. Inspired by classical exponential density roll-off topologies, we apply a new artificial foveation module to our novel attention prediction pipeline: the Semantic-based Bayesian Attention (SemBA) framework. We aim at reducing detection-related computational costs without compromising visual task accuracy, thereby making SemBA more biologically plausible. The proposed multi-scale pyramidal field-of-view retains maximum acuity at an innermost level, around a focal point, while gradually increasing distortion for outer levels to mimic peripheral uncertainty via downsampling. In this work we evaluate the performance of our novel Multi-Scale Fovea, incorporated into \textit{SemBA}, on target-present visual search. We also compare it against other artificial foveal systems, and conduct ablation studies with different deep object detection models to assess the impact of the new topology in terms of computational costs. We experimentally demonstrate that including the new Multi-Scale Fovea module effectively reduces inherent processing costs while improving SemBA's scanpath prediction accuracy. Remarkably, we show that SemBA closely approximates human consistency while retaining the actual human fovea's proportions.

Paper Structure

This paper contains 14 sections, 3 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Illustration of our novel Multi-Scale Fovea mechanism. The proposed method consists of building a multi-resolution pyramid image_pyramid, around a selected focal point, and then downsampling all levels to the size of the innermost layer, to mimic the eccentricity effect eccentricity. Object detections from outer levels tend to reflect the uncertainty that derives from such exponential pixel density reduction od_pyramid. This technique facilitates the sequential extraction of semantic information icdl in a more cost-efficient and biologically plausible manner.
  • Figure 2: Schematic representation of the Semantic-based Bayesian Attention framework, i.e. SemBAneurocomp, for attention prediction, applied to visual search. In this example, we apply the proposed Multi-Scale Fovea mechanism, that generates $N=3$ layers around a focal point, starting at a base dimension of $256\times256$ pixels and doubling in size between consecutive layers. All layers are then down-sampled to the base dimension to emulate peripheral distortion. Note that the artificial foveation block can be replaced by any other mechanism, such as FOVEAfovea or Laplacian Foveationfovsys. The same follows for the deep object detection block, where although we apply YOLOv11 yolov11 any other modern detection model can be integrated, e.g. DETR detr, RT-DETR rtdetr. Object detections are then filtered and fused on multiple grids of $20\times32$ cells, to build semantic belief maps for all $K$ known classes. SemBA's active perception system then selects the map for a specific targeted class, which is cup in this example, and selects the next-best fixation point based on its current belief state. This process is repeated until a termination criterion is met, while applying inhibition of return (IOR cocosearch18) to prevent revisiting fixated locations.
  • Figure 3: Comparison of the field-of-view topologies for each different foveal system used in this work: Full resolution image (baseline), Laplacian Foveationfovsys, FOVEA (Magnification) fovea, and our Multi-Scale Fovea (in order, from left to right). Here, Multi-Scale Fovea's parameters are set such that the maximum acuity region, corresponding to the fovea fov, comprises a region equivalent to around 2.0° of visual angle ($64 \times 64$ pixel patch in a $1050 \times 1680$ image). Note that SemBA processes each Multi-Scale Fovea layer separately. Here, to showcase the topology, we overlap all layers while preserving their resolution.
  • Figure 4: Cumulative performances of humans (average), a random selection algorithm, and SemBA under different (a) object detectors, (b) foveal systems, and (c) Multi-Scale Fovea configurations, on target-present visual search. Regarding experiment (a) we apply our Multi-Scale Fovea on a $4 \times 160$ configuration. We use the same configuration for experiment (b) when assessing the performance of the Multi-Scale Fovea. For the experiments (b) and (c) we use SemBA$\times$DETR.