Table of Contents
Fetching ...

EVA: Bridging Performance and Human Alignment in Hard-Attention Vision Models for Image Classification

Pengcheng Pan, Yonekura Shogo, Kuniyoshi Yasuo

Abstract

Optimizing vision models purely for classification accuracy can impose an alignment tax, degrading human-like scanpaths and limiting interpretability. We introduce EVA, a neuroscience-inspired hard-attention mechanistic testbed that makes the performance-human-likeness trade-off explicit and adjustable. EVA samples a small number of sequential glimpses using a minimal fovea-periphery representation with CNN-based feature extractor and integrates variance control and adaptive gating to stabilize and regulate attention dynamics. EVA is trained with the standard classification objective without gaze supervision. On CIFAR-10 with dense human gaze annotations, EVA improves scanpath alignment under established metrics such as DTW, NSS, while maintaining competitive accuracy. Ablations show that CNN-based feature extraction drives accuracy but suppresses human-likeness, whereas variance control and gating restore human-aligned trajectories with minimal performance loss. We further validate EVA's scalability on ImageNet-100 and evaluate scanpath alignment on COCO-Search18 without COCO-Search18 gaze supervision or finetuning, where EVA yields human-like scanpaths on natural scenes without additional training. Overall, EVA provides a principled framework for trustworthy, human-interpretable active vision.

EVA: Bridging Performance and Human Alignment in Hard-Attention Vision Models for Image Classification

Abstract

Optimizing vision models purely for classification accuracy can impose an alignment tax, degrading human-like scanpaths and limiting interpretability. We introduce EVA, a neuroscience-inspired hard-attention mechanistic testbed that makes the performance-human-likeness trade-off explicit and adjustable. EVA samples a small number of sequential glimpses using a minimal fovea-periphery representation with CNN-based feature extractor and integrates variance control and adaptive gating to stabilize and regulate attention dynamics. EVA is trained with the standard classification objective without gaze supervision. On CIFAR-10 with dense human gaze annotations, EVA improves scanpath alignment under established metrics such as DTW, NSS, while maintaining competitive accuracy. Ablations show that CNN-based feature extraction drives accuracy but suppresses human-likeness, whereas variance control and gating restore human-aligned trajectories with minimal performance loss. We further validate EVA's scalability on ImageNet-100 and evaluate scanpath alignment on COCO-Search18 without COCO-Search18 gaze supervision or finetuning, where EVA yields human-like scanpaths on natural scenes without additional training. Overall, EVA provides a principled framework for trustworthy, human-interpretable active vision.

Paper Structure

This paper contains 41 sections, 19 equations, 13 figures, 3 tables, 1 algorithm.

Figures (13)

  • Figure 1: EVA reduces the alignment tax through mechanism-aware design. Left: a schematic trade-off between accuracy and scanpath similarity to humans. EVA improves the trade-off relative to full-image baselines and prior hard-attention models. Right: EVA uses minimal fovea--periphery sensing with variance control and adaptive gating to regulate active sampling dynamics.
  • Figure 2: EVA model architecture. Top: one glimpse step at time $t$. EVA combines a minimal fovea--periphery retina with a CNN feature extractor and a two-level recurrent backbone. The lower recurrent state updates the glimpse representation and controls the next fixation. An adaptive gate regulates information flow to the upper recurrent classifier. Bottom: a prediction-error signal modulates the fixation variance and the gate, regulating evidence acquisition dynamics. The components are motivated by functional motifs rather than biological fidelity.
  • Figure 3: Qualitative comparison of scanpaths on CIFAR-10. Columns show different models and rows show example images. Orange denotes the model scanpath and blue denotes the human scanpath.
  • Figure 4: Accuracy--alignment trade-off on CIFAR-10. The x-axis is center-debiased GCS and the y-axis is classification accuracy.
  • Figure 5: ImageNet-100 scalability. Top: ImageNet-100 accuracy and FLOPs. Pre. indicates initialization from ImageNet-1K pretrained weights. Bottom: an example scanpath from EVA where the red rectangle indicates the foveal crop.
  • ...and 8 more figures