Table of Contents
Fetching ...

AdaGlimpse: Active Visual Exploration with Arbitrary Glimpse Position and Scale

Adam Pardyl, Michał Wronka, Maciej Wołczyk, Kamil Adamczewski, Tomasz Trzciński, Bartosz Zieliński

TL;DR

AdaGlimpse uses Soft Actor-Critic, a reinforcement learning algorithm tailored for exploration tasks, to select glimpses of arbitrary position and scale, which enables the model to rapidly establish a general awareness of the environment before zooming in for detailed analysis.

Abstract

Active Visual Exploration (AVE) is a task that involves dynamically selecting observations (glimpses), which is critical to facilitate comprehension and navigation within an environment. While modern AVE methods have demonstrated impressive performance, they are constrained to fixed-scale glimpses from rigid grids. In contrast, existing mobile platforms equipped with optical zoom capabilities can capture glimpses of arbitrary positions and scales. To address this gap between software and hardware capabilities, we introduce AdaGlimpse. It uses Soft Actor-Critic, a reinforcement learning algorithm tailored for exploration tasks, to select glimpses of arbitrary position and scale. This approach enables our model to rapidly establish a general awareness of the environment before zooming in for detailed analysis. Experimental results demonstrate that AdaGlimpse surpasses previous methods across various visual tasks while maintaining greater applicability in realistic AVE scenarios.

AdaGlimpse: Active Visual Exploration with Arbitrary Glimpse Position and Scale

TL;DR

AdaGlimpse uses Soft Actor-Critic, a reinforcement learning algorithm tailored for exploration tasks, to select glimpses of arbitrary position and scale, which enables the model to rapidly establish a general awareness of the environment before zooming in for detailed analysis.

Abstract

Active Visual Exploration (AVE) is a task that involves dynamically selecting observations (glimpses), which is critical to facilitate comprehension and navigation within an environment. While modern AVE methods have demonstrated impressive performance, they are constrained to fixed-scale glimpses from rigid grids. In contrast, existing mobile platforms equipped with optical zoom capabilities can capture glimpses of arbitrary positions and scales. To address this gap between software and hardware capabilities, we introduce AdaGlimpse. It uses Soft Actor-Critic, a reinforcement learning algorithm tailored for exploration tasks, to select glimpses of arbitrary position and scale. This approach enables our model to rapidly establish a general awareness of the environment before zooming in for detailed analysis. Experimental results demonstrate that AdaGlimpse surpasses previous methods across various visual tasks while maintaining greater applicability in realistic AVE scenarios.
Paper Structure (34 sections, 8 figures, 4 tables)

This paper contains 34 sections, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Adaptive Glimpse (AdaGlimpse): Our approach selects and processes glimpses of arbitrary position and scale, fully exploiting the capabilities of modern hardware. In this example, AdaGlimpse selects a low-resolution glimpse of the whole environment. Based on this glimpse, it predicts a bird with probability $0.01$, too low to make the final decision. Instead, it selects the second glimpse by zooming in to the upper left corner. The process repeats four times until the probability of the predicted class is higher than a specified threshold.
  • Figure 2: Architecture: AdaGlimpse consists of two parts: a vision transformer-based encoder with a task-specific head (see \ref{['sec:elasticvit']}) and a Soft Actor-Critic RL agent (see \ref{['sec:rl']}). At each exploration step, the RL agent selects the position and scale of the next glimpse based on the information about previous patches, their coordinates, importance, and latent representations.
  • Figure 3: RL agent: RL module of AdaGlimpse uses two networks: the actor and the critic. The actor predicts the action $a_t$ (position and scale of the next glimpse) based on state $s_t = (\widehat{G}_t, \widehat{C}_t, \widehat{I_t}, \widehat{H_t})$. The critic estimates the $Q(s_t, a_t)$, corresponding to the expected cumulative reward for taking this action.
  • Figure 4: Glimpse selection step-by-step: AdaGlimpse explores $224 \times 224$ images from ImageNet with $32 \times 32$ glimpses of variable scale, zooming in on objects of interest and stopping the process after reaching $75\%$ predicted probability. The rows correspond to: A) glimpse locations, B) pixels visible to the model (interpolated from glimpses for preview), C) predicted label, D) prediction probability.
  • Figure 5: Reconstruction quality for SUN360 (top) and ADE20K (bottom): Sample reconstructions of our method compared with AME pardyl2023active, AttSeg seifi2020attend, GlAtEx seifi2021glimpse and SimGlim jha2023simglim on the SUN360 and ADE20K datasets. Reconstructions done with our method are visibly more detailed and less blurry than those obtained by baseline methods. Notice that images for comparison were taken from the baseline publications (we did not select them).
  • ...and 3 more figures