Addressing a fundamental limitation in deep vision models: lack of spatial attention

Ali Borji

Addressing a fundamental limitation in deep vision models: lack of spatial attention

Ali Borji

TL;DR

A significant limitation in current deep learning models, particularly vision models, is highlighted and two solutions that could pave the way for the next generation of more efficient vision models are proposed.

Abstract

The primary aim of this manuscript is to underscore a significant limitation in current deep learning models, particularly vision models. Unlike human vision, which efficiently selects only the essential visual areas for further processing, leading to high speed and low energy consumption, deep vision models process the entire image. In this work, we examine this issue from a broader perspective and propose two solutions that could pave the way for the next generation of more efficient vision models. In the first solution, convolution and pooling operations are selectively applied to altered regions, with a change map sent to subsequent layers. This map indicates which computations need to be repeated. In the second solution, only the modified regions are processed by a semantic segmentation model, and the resulting segments are inserted into the corresponding areas of the previous output map. The code is available at https://github.com/aliborji/spatial_attention.

Addressing a fundamental limitation in deep vision models: lack of spatial attention

TL;DR

Abstract

Paper Structure (7 sections, 11 figures, 1 table)

This paper contains 7 sections, 11 figures, 1 table.

Motivation
Related work
Potential solution I
Experiments and results
Potential solution II
Conclusion
Appendix

Figures (11)

Figure 1: In real-world visual content, not much changes; the majority of the scene is often static. For example, in this image, a car has entered the scene and perhaps some tree branches have moved slightly due to the wind, but the rest of the scene remains largely unchanged. left: t - 1, right: t.
Figure 2: Illustration of the Basic Idea: First, a change map is computed from subsequent frames ($I_{t-1}$ and $I_{t}$). This change map is sent to the first convolution layer, which updates the values in its previous output only for the changed regions. Knowing which regions have been updated, this layer generates its own change map and sends it to the next layer, and so on. Each layer maintains its own memory to avoid redundant calculations.
Figure 3: Top: Architecture of CNN1 and CNN2, Bottom: CNN3 Architecture.
Figure 4: Sequences of Images Presented to the CNNs: Top row (Exp I): The digits remain unchanged. Bottom row (Exp II): A digit is progressively shifted 1 pixel to the right.
Figure 5: Top row: The change map (64 x 64) showing the effect of shifting the image 1 pixel to the right. The subsequent rows, in order, show the change maps at each layer: 1st conv layer (64 x 64), 1st pooling layer (32 x 32), 2nd conv layer (32 x 32), and 2nd pooling layer (32 x 32). Each change map represents the output of a layer and is passed to the next layer to indicate what needs to be reused and what requires recomputation.
...and 6 more figures

Addressing a fundamental limitation in deep vision models: lack of spatial attention

TL;DR

Abstract

Addressing a fundamental limitation in deep vision models: lack of spatial attention

Authors

TL;DR

Abstract

Table of Contents

Figures (11)