Table of Contents
Fetching ...

Fraesormer: Learning Adaptive Sparse Transformer for Efficient Food Recognition

Shun Zou, Yi Zou, Mingya Zhang, Shipeng Luo, Zhihao Chen, Guangwei Gao

TL;DR

Fraesormer tackles the inefficiency of Transformer-based food recognition by introducing adaptive sparse attention and multi-scale feature gating. It combines Adaptive Top-k Sparse Partial Attention (ATK-SPA) with a Gated Dynamic Top-K Operator (GDTKO) and a hierarchical, scale-sensitive gate (HSSFGN) to reduce redundancy while capturing cross-scale context. Extensive experiments on four benchmark datasets show improved accuracy with lower parameter counts and MACs compared to state-of-the-art CNN-, ViT-, and hybrid-based models, highlighting strong edge-efficiency. The work advances practical food recognition by enabling robust, scalable inference on unstructured dishes with limited data and resources.

Abstract

In recent years, Transformer has witnessed significant progress in food recognition. However, most existing approaches still face two critical challenges in lightweight food recognition: (1) the quadratic complexity and redundant feature representation from interactions with irrelevant tokens; (2) static feature recognition and single-scale representation, which overlook the unstructured, non-fixed nature of food images and the need for multi-scale features. To address these, we propose an adaptive and efficient sparse Transformer architecture (Fraesormer) with two core designs: Adaptive Top-k Sparse Partial Attention (ATK-SPA) and Hierarchical Scale-Sensitive Feature Gating Network (HSSFGN). ATK-SPA uses a learnable Gated Dynamic Top-K Operator (GDTKO) to retain critical attention scores, filtering low query-key matches that hinder feature aggregation. It also introduces a partial channel mechanism to reduce redundancy and promote expert information flow, enabling local-global collaborative modeling. HSSFGN employs gating mechanism to achieve multi-scale feature representation, enhancing contextual semantic information. Extensive experiments show that Fraesormer outperforms state-of-the-art methods. code is available at https://zs1314.github.io/Fraesormer.

Fraesormer: Learning Adaptive Sparse Transformer for Efficient Food Recognition

TL;DR

Fraesormer tackles the inefficiency of Transformer-based food recognition by introducing adaptive sparse attention and multi-scale feature gating. It combines Adaptive Top-k Sparse Partial Attention (ATK-SPA) with a Gated Dynamic Top-K Operator (GDTKO) and a hierarchical, scale-sensitive gate (HSSFGN) to reduce redundancy while capturing cross-scale context. Extensive experiments on four benchmark datasets show improved accuracy with lower parameter counts and MACs compared to state-of-the-art CNN-, ViT-, and hybrid-based models, highlighting strong edge-efficiency. The work advances practical food recognition by enabling robust, scalable inference on unstructured dishes with limited data and resources.

Abstract

In recent years, Transformer has witnessed significant progress in food recognition. However, most existing approaches still face two critical challenges in lightweight food recognition: (1) the quadratic complexity and redundant feature representation from interactions with irrelevant tokens; (2) static feature recognition and single-scale representation, which overlook the unstructured, non-fixed nature of food images and the need for multi-scale features. To address these, we propose an adaptive and efficient sparse Transformer architecture (Fraesormer) with two core designs: Adaptive Top-k Sparse Partial Attention (ATK-SPA) and Hierarchical Scale-Sensitive Feature Gating Network (HSSFGN). ATK-SPA uses a learnable Gated Dynamic Top-K Operator (GDTKO) to retain critical attention scores, filtering low query-key matches that hinder feature aggregation. It also introduces a partial channel mechanism to reduce redundancy and promote expert information flow, enabling local-global collaborative modeling. HSSFGN employs gating mechanism to achieve multi-scale feature representation, enhancing contextual semantic information. Extensive experiments show that Fraesormer outperforms state-of-the-art methods. code is available at https://zs1314.github.io/Fraesormer.

Paper Structure

This paper contains 10 sections, 13 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: A seven-dimensional radar map of the Top-1 Acc of ETHZ-101 food-101, Vireo-172 food-172, UEC-256 food-256, SuShi-50 qiu2019mining, along with Average Acc, Params, and GMACs.
  • Figure 2: Visualization of the impact of each spatial location on the final prediction of the DeiT-S model Chefer_2021_CVPRtouvron2021training. The results show that the final prediction of the vision transformer is primarily based on the most influential tokens, indicating that a large portion of tokens can be removed without affecting performance.
  • Figure 3: The overall architecture of the proposed Fraesormer.
  • Figure 4: Ablation analysis of different values of $k$.