Table of Contents
Fetching ...

GFT: Gradient Focal Transformer

Boris Kriuk, Simranjit Kaur Gill, Shoaib Aslam, Amir Fakhrutdinov

TL;DR

The paper tackles fine-grained image classification (FGIC) by addressing the limitations of traditional CNNs and ViT-based approaches in balancing global context with discriminative local details. It introduces Gradient Focal Transformer (GFT), which combines Gradient Attention Learning Alignment (GALA) for gradient-guided feature importance with Progressive Patch Selection (PPS) to progressively prune unrelevant patches. GFT demonstrates state-of-the-art performance on FGVC Aircraft, Food-101, and COCO with 93M parameters, while offering improved efficiency and interpretable gradient-based attention maps. The work contributes a practical, scalable FGIC framework that bridges global context and local detail extraction, with potential for deployment in real-world scenarios and avenues for future multi-modal extensions and edge-device optimizations.

Abstract

Fine-Grained Image Classification (FGIC) remains a complex task in computer vision, as it requires models to distinguish between categories with subtle localized visual differences. Well-studied CNN-based models, while strong in local feature extraction, often fail to capture the global context required for fine-grained recognition, while more recent ViT-backboned models address FGIC with attention-driven mechanisms but lack the ability to adaptively focus on truly discriminative regions. TransFG and other ViT-based extensions introduced part-aware token selection to enhance attention localization, yet they still struggle with computational efficiency, attention region selection flexibility, and detail-focus narrative in complex environments. This paper introduces GFT (Gradient Focal Transformer), a new ViT-derived framework created for FGIC tasks. GFT integrates the Gradient Attention Learning Alignment (GALA) mechanism to dynamically prioritize class-discriminative features by analyzing attention gradient flow. Coupled with a Progressive Patch Selection (PPS) strategy, the model progressively filters out less informative regions, reducing computational overhead while enhancing sensitivity to fine details. GFT achieves SOTA accuracy on FGVC Aircraft, Food-101, and COCO datasets with 93M parameters, outperforming ViT-based advanced FGIC models in efficiency. By bridging global context and localized detail extraction, GFT sets a new benchmark in fine-grained recognition, offering interpretable solutions for real-world deployment scenarios.

GFT: Gradient Focal Transformer

TL;DR

The paper tackles fine-grained image classification (FGIC) by addressing the limitations of traditional CNNs and ViT-based approaches in balancing global context with discriminative local details. It introduces Gradient Focal Transformer (GFT), which combines Gradient Attention Learning Alignment (GALA) for gradient-guided feature importance with Progressive Patch Selection (PPS) to progressively prune unrelevant patches. GFT demonstrates state-of-the-art performance on FGVC Aircraft, Food-101, and COCO with 93M parameters, while offering improved efficiency and interpretable gradient-based attention maps. The work contributes a practical, scalable FGIC framework that bridges global context and local detail extraction, with potential for deployment in real-world scenarios and avenues for future multi-modal extensions and edge-device optimizations.

Abstract

Fine-Grained Image Classification (FGIC) remains a complex task in computer vision, as it requires models to distinguish between categories with subtle localized visual differences. Well-studied CNN-based models, while strong in local feature extraction, often fail to capture the global context required for fine-grained recognition, while more recent ViT-backboned models address FGIC with attention-driven mechanisms but lack the ability to adaptively focus on truly discriminative regions. TransFG and other ViT-based extensions introduced part-aware token selection to enhance attention localization, yet they still struggle with computational efficiency, attention region selection flexibility, and detail-focus narrative in complex environments. This paper introduces GFT (Gradient Focal Transformer), a new ViT-derived framework created for FGIC tasks. GFT integrates the Gradient Attention Learning Alignment (GALA) mechanism to dynamically prioritize class-discriminative features by analyzing attention gradient flow. Coupled with a Progressive Patch Selection (PPS) strategy, the model progressively filters out less informative regions, reducing computational overhead while enhancing sensitivity to fine details. GFT achieves SOTA accuracy on FGVC Aircraft, Food-101, and COCO datasets with 93M parameters, outperforming ViT-based advanced FGIC models in efficiency. By bridging global context and localized detail extraction, GFT sets a new benchmark in fine-grained recognition, offering interpretable solutions for real-world deployment scenarios.

Paper Structure

This paper contains 12 sections, 6 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: GFT Architecture Overview.
  • Figure 2: Absolute Attention vs GALA.
  • Figure 3: Progressive Patch Selection in GFT.
  • Figure 4: GFT Importance Regions in FGVC Aircraft Dataset.
  • Figure 5: Gradient Flow across GFT Layers.