Table of Contents
Fetching ...

HAViT: Historical Attention Vision Transformer

Swarnendu Banik, Manish Das, Shiv Ram Dubey, Satish Kumar Singh

Abstract

Vision Transformers have excelled in computer vision but their attention mechanisms operate independently across layers, limiting information flow and feature learning. We propose an effective cross-layer attention propagation method that preserves and integrates historical attention matrices across encoder layers, offering a principled refinement of inter-layer information flow in Vision Transformers. This approach enables progressive refinement of attention patterns throughout the transformer hierarchy, enhancing feature acquisition and optimization dynamics. The method requires minimal architectural changes, adding only attention matrix storage and blending operations. Comprehensive experiments on CIFAR-100 and TinyImageNet demonstrate consistent accuracy improvements, with ViT performance increasing from 75.74% to 77.07% on CIFAR-100 (+1.33%) and from 57.82% to 59.07% on TinyImageNet (+1.25%). Cross-architecture validation shows similar gains across transformer variants, with CaiT showing 1.01% enhancement. Systematic analysis identifies the blending hyperparameter of historical attention (alpha = 0.45) as optimal across all configurations, providing the ideal balance between current and historical attention information. Random initialization consistently outperforms zero initialization, indicating that diverse initial attention patterns accelerate convergence and improve final performance. Our code is publicly available at https://github.com/banik-s/HAViT.

HAViT: Historical Attention Vision Transformer

Abstract

Vision Transformers have excelled in computer vision but their attention mechanisms operate independently across layers, limiting information flow and feature learning. We propose an effective cross-layer attention propagation method that preserves and integrates historical attention matrices across encoder layers, offering a principled refinement of inter-layer information flow in Vision Transformers. This approach enables progressive refinement of attention patterns throughout the transformer hierarchy, enhancing feature acquisition and optimization dynamics. The method requires minimal architectural changes, adding only attention matrix storage and blending operations. Comprehensive experiments on CIFAR-100 and TinyImageNet demonstrate consistent accuracy improvements, with ViT performance increasing from 75.74% to 77.07% on CIFAR-100 (+1.33%) and from 57.82% to 59.07% on TinyImageNet (+1.25%). Cross-architecture validation shows similar gains across transformer variants, with CaiT showing 1.01% enhancement. Systematic analysis identifies the blending hyperparameter of historical attention (alpha = 0.45) as optimal across all configurations, providing the ideal balance between current and historical attention information. Random initialization consistently outperforms zero initialization, indicating that diverse initial attention patterns accelerate convergence and improve final performance. Our code is publicly available at https://github.com/banik-s/HAViT.
Paper Structure (21 sections, 8 equations, 5 figures, 5 tables, 1 algorithm)

This paper contains 21 sections, 8 equations, 5 figures, 5 tables, 1 algorithm.

Figures (5)

  • Figure 1: Overview of the modified ViT with historical attention propagation.
  • Figure 2: (a) Test accuracy comparison over epochs for CIFAR-100 dataset. (b) Training loss comparison over epochs for CIFAR-100 dataset.
  • Figure 3: (a) Test accuracy comparison over epochs for TinyImageNet dataset. (b) Training loss comparison over epochs for TinyImageNet dataset.
  • Figure 4: Initialization Strategy Impact Analysis: (a) Accuracy for $\alpha$ on CIFAR-100 (ViT baseline 75.74%), (b) Accuracy for $\alpha$ on TinyImageNet (ViT baseline 57.82%).
  • Figure 5: Attention map comparison between Base ViT and HAViT across CIFAR-100 classes