Fraesormer: Learning Adaptive Sparse Transformer for Efficient Food Recognition

Shun Zou; Yi Zou; Mingya Zhang; Shipeng Luo; Zhihao Chen; Guangwei Gao

Fraesormer: Learning Adaptive Sparse Transformer for Efficient Food Recognition

Shun Zou, Yi Zou, Mingya Zhang, Shipeng Luo, Zhihao Chen, Guangwei Gao

TL;DR

Fraesormer tackles the inefficiency of Transformer-based food recognition by introducing adaptive sparse attention and multi-scale feature gating. It combines Adaptive Top-k Sparse Partial Attention (ATK-SPA) with a Gated Dynamic Top-K Operator (GDTKO) and a hierarchical, scale-sensitive gate (HSSFGN) to reduce redundancy while capturing cross-scale context. Extensive experiments on four benchmark datasets show improved accuracy with lower parameter counts and MACs compared to state-of-the-art CNN-, ViT-, and hybrid-based models, highlighting strong edge-efficiency. The work advances practical food recognition by enabling robust, scalable inference on unstructured dishes with limited data and resources.

Abstract

In recent years, Transformer has witnessed significant progress in food recognition. However, most existing approaches still face two critical challenges in lightweight food recognition: (1) the quadratic complexity and redundant feature representation from interactions with irrelevant tokens; (2) static feature recognition and single-scale representation, which overlook the unstructured, non-fixed nature of food images and the need for multi-scale features. To address these, we propose an adaptive and efficient sparse Transformer architecture (Fraesormer) with two core designs: Adaptive Top-k Sparse Partial Attention (ATK-SPA) and Hierarchical Scale-Sensitive Feature Gating Network (HSSFGN). ATK-SPA uses a learnable Gated Dynamic Top-K Operator (GDTKO) to retain critical attention scores, filtering low query-key matches that hinder feature aggregation. It also introduces a partial channel mechanism to reduce redundancy and promote expert information flow, enabling local-global collaborative modeling. HSSFGN employs gating mechanism to achieve multi-scale feature representation, enhancing contextual semantic information. Extensive experiments show that Fraesormer outperforms state-of-the-art methods. code is available at https://zs1314.github.io/Fraesormer.

Fraesormer: Learning Adaptive Sparse Transformer for Efficient Food Recognition

TL;DR

Abstract

Fraesormer: Learning Adaptive Sparse Transformer for Efficient Food Recognition

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (4)