Table of Contents
Fetching ...

SIEFormer: Spectral-Interpretable and -Enhanced Transformer for Generalized Category Discovery

Chunming Li, Shidong Wang, Tong Xin, Haofeng Zhang

TL;DR

SIEFormer reinterprets Vision Transformer attention through spectral analysis and introduces two complementary branches: an implicit Band-Adaptive Filter (BaF) operating on token values via graph Laplacian spectral filtering, and an explicit Maneuverable Filtering Layer (MFL) that filters value features in the frequency domain using Fourier transforms. This dual-spectral design enables joint optimization for Generalized Category Discovery, achieving state-of-the-art results on diverse generic and fine-grained datasets while offering strong ablations that highlight the contribution of each spectral component. The approach demonstrates robust performance improvements, especially in discovering new categories, and shows favorable efficiency compared to strong ViT-based baselines. The work establishes a pathway for incorporating spectral theory into open-world recognition, with potential extensions to related tasks such as incremental learning and out-of-distribution detection.

Abstract

This paper presents a novel approach, Spectral-Interpretable and -Enhanced Transformer (SIEFormer), which leverages spectral analysis to reinterpret the attention mechanism within Vision Transformer (ViT) and enhance feature adaptability, with particular emphasis on challenging Generalized Category Discovery (GCD) tasks. The proposed SIEFormer is composed of two main branches, each corresponding to an implicit and explicit spectral perspective of the ViT, enabling joint optimization. The implicit branch realizes the use of different types of graph Laplacians to model the local structure correlations of tokens, along with a novel Band-adaptive Filter (BaF) layer that can flexibly perform both band-pass and band-reject filtering. The explicit branch, on the other hand, introduces a Maneuverable Filtering Layer (MFL) that learns global dependencies among tokens by applying the Fourier transform to the input ``value" features, modulating the transformed signal with a set of learnable parameters in the frequency domain, and then performing an inverse Fourier transform to obtain the enhanced features. Extensive experiments reveal state-of-the-art performance on multiple image recognition datasets, reaffirming the superiority of our approach through ablation studies and visualizations.

SIEFormer: Spectral-Interpretable and -Enhanced Transformer for Generalized Category Discovery

TL;DR

SIEFormer reinterprets Vision Transformer attention through spectral analysis and introduces two complementary branches: an implicit Band-Adaptive Filter (BaF) operating on token values via graph Laplacian spectral filtering, and an explicit Maneuverable Filtering Layer (MFL) that filters value features in the frequency domain using Fourier transforms. This dual-spectral design enables joint optimization for Generalized Category Discovery, achieving state-of-the-art results on diverse generic and fine-grained datasets while offering strong ablations that highlight the contribution of each spectral component. The approach demonstrates robust performance improvements, especially in discovering new categories, and shows favorable efficiency compared to strong ViT-based baselines. The work establishes a pathway for incorporating spectral theory into open-world recognition, with potential extensions to related tasks such as incremental learning and out-of-distribution detection.

Abstract

This paper presents a novel approach, Spectral-Interpretable and -Enhanced Transformer (SIEFormer), which leverages spectral analysis to reinterpret the attention mechanism within Vision Transformer (ViT) and enhance feature adaptability, with particular emphasis on challenging Generalized Category Discovery (GCD) tasks. The proposed SIEFormer is composed of two main branches, each corresponding to an implicit and explicit spectral perspective of the ViT, enabling joint optimization. The implicit branch realizes the use of different types of graph Laplacians to model the local structure correlations of tokens, along with a novel Band-adaptive Filter (BaF) layer that can flexibly perform both band-pass and band-reject filtering. The explicit branch, on the other hand, introduces a Maneuverable Filtering Layer (MFL) that learns global dependencies among tokens by applying the Fourier transform to the input ``value" features, modulating the transformed signal with a set of learnable parameters in the frequency domain, and then performing an inverse Fourier transform to obtain the enhanced features. Extensive experiments reveal state-of-the-art performance on multiple image recognition datasets, reaffirming the superiority of our approach through ablation studies and visualizations.
Paper Structure (31 sections, 1 theorem, 28 equations, 9 figures, 12 tables, 1 algorithm)

This paper contains 31 sections, 1 theorem, 28 equations, 9 figures, 12 tables, 1 algorithm.

Key Result

Proposition 1

The symmetrically normalized adjacency matrix $\tilde{\mathbf{A}}$ is a reasonable simplification of $\mathbf{T}$ in Eq. (BaF).

Figures (9)

  • Figure 1: Comparison between the standard Vision Transformer (ViT) on the left and the proposed SIEFormer on the right. SIEFormer introduces two spectral views: implicit, using graph Laplacian filtering (Band-adaptive Filter, BaF) on value features generated from the self-attention, and explicit, applying the Fourier transform (Maneuverable Filtering Layer, MFL) for joint optimization and the above optimization strategy is beneficial to generate discriminative features which is helpful to discover novel categories.
  • Figure 2: Overview of SIEFormer. Implicit Spectral Branch represents the implicit use of Band-adaptive Filter to operate the values in self-attention with the eigenvalues of the Laplace matrix, and Explicit Spectral Branch stands for directly converting values to the spectral domain using the fast Fourier transform and reconstructing values using MFL.
  • Figure 3: Heatmaps using different filters. Heatmaps taken from the values matrix of filter outputs trained on the CUB-200 dataset cub using the low-pass filter, high-pass filter, and the proposed Band-adaptive Filter (BaF), respectively.
  • Figure 4: Performance (%) with different weight of supervised contrastive learning.
  • Figure 5: The t-SNE visualization from new categories on Stanford-Cars with only supervised contrastive learning.
  • ...and 4 more figures

Theorems & Definitions (2)

  • Proposition 1
  • proof