Table of Contents
Fetching ...

START: A Generalized State Space Model with Saliency-Driven Token-Aware Transformation

Jintao Guo, Lei Qi, Yinghuan Shi, Yang Gao

TL;DR

The paper targets Domain Generalization by improving the generalization of Mamba-based State Space Models under domain shifts. It provides a theoretical analysis showing that input-dependent matrices can accumulate domain-specific information, and introduces START, a saliency-driven token-aware transformation with two variants (START-M and START-X) to selectively perturb salient tokens during training. START achieves state-of-the-art results on five DG benchmarks with linear sequence-length complexity and no inference-time overhead, outperforming CNN and ViT baselines as well as recent Mamba-based methods. The approach is supported by extensive experiments, ablations, and theoretical proofs, highlighting its potential for robust DG in vision tasks.

Abstract

Domain Generalization (DG) aims to enable models to generalize to unseen target domains by learning from multiple source domains. Existing DG methods primarily rely on convolutional neural networks (CNNs), which inherently learn texture biases due to their limited receptive fields, making them prone to overfitting source domains. While some works have introduced transformer-based methods (ViTs) for DG to leverage the global receptive field, these methods incur high computational costs due to the quadratic complexity of self-attention. Recently, advanced state space models (SSMs), represented by Mamba, have shown promising results in supervised learning tasks by achieving linear complexity in sequence length during training and fast RNN-like computation during inference. Inspired by this, we investigate the generalization ability of the Mamba model under domain shifts and find that input-dependent matrices within SSMs could accumulate and amplify domain-specific features, thus hindering model generalization. To address this issue, we propose a novel SSM-based architecture with saliency-based token-aware transformation (namely START), which achieves state-of-the-art (SOTA) performances and offers a competitive alternative to CNNs and ViTs. Our START can selectively perturb and suppress domain-specific features in salient tokens within the input-dependent matrices of SSMs, thus effectively reducing the discrepancy between different domains. Extensive experiments on five benchmarks demonstrate that START outperforms existing SOTA DG methods with efficient linear complexity. Our code is available at https://github.com/lingeringlight/START.

START: A Generalized State Space Model with Saliency-Driven Token-Aware Transformation

TL;DR

The paper targets Domain Generalization by improving the generalization of Mamba-based State Space Models under domain shifts. It provides a theoretical analysis showing that input-dependent matrices can accumulate domain-specific information, and introduces START, a saliency-driven token-aware transformation with two variants (START-M and START-X) to selectively perturb salient tokens during training. START achieves state-of-the-art results on five DG benchmarks with linear sequence-length complexity and no inference-time overhead, outperforming CNN and ViT baselines as well as recent Mamba-based methods. The approach is supported by extensive experiments, ablations, and theoretical proofs, highlighting its potential for robust DG in vision tasks.

Abstract

Domain Generalization (DG) aims to enable models to generalize to unseen target domains by learning from multiple source domains. Existing DG methods primarily rely on convolutional neural networks (CNNs), which inherently learn texture biases due to their limited receptive fields, making them prone to overfitting source domains. While some works have introduced transformer-based methods (ViTs) for DG to leverage the global receptive field, these methods incur high computational costs due to the quadratic complexity of self-attention. Recently, advanced state space models (SSMs), represented by Mamba, have shown promising results in supervised learning tasks by achieving linear complexity in sequence length during training and fast RNN-like computation during inference. Inspired by this, we investigate the generalization ability of the Mamba model under domain shifts and find that input-dependent matrices within SSMs could accumulate and amplify domain-specific features, thus hindering model generalization. To address this issue, we propose a novel SSM-based architecture with saliency-based token-aware transformation (namely START), which achieves state-of-the-art (SOTA) performances and offers a competitive alternative to CNNs and ViTs. Our START can selectively perturb and suppress domain-specific features in salient tokens within the input-dependent matrices of SSMs, thus effectively reducing the discrepancy between different domains. Extensive experiments on five benchmarks demonstrate that START outperforms existing SOTA DG methods with efficient linear complexity. Our code is available at https://github.com/lingeringlight/START.

Paper Structure

This paper contains 17 sections, 23 equations, 4 figures, 14 tables.

Figures (4)

  • Figure 1: Analysis of the input-dependent matrices in SSMs. We investigate domain discrepancy in the input sequence $x$, response sequence $y$, and the input-dependent matrics $\tilde{\Delta}$, $B$, and $C$. The results indicate that the input-dependent matrices can accumulate the domain-specific features during the recurrent process, potentially increasing domain gap. We experiment on PACS li2017deeper with Sketch as the target domain, analyzing the representations from the last block of VMamba backbone liu2024vmamba.
  • Figure 2: Overall Architecture of the Proposed START Framework. The core of the START framework is the Saliency-driven Token-Aware Transformation, which uses a saliency-driven scheme to localize tokens targeted by input-dependent matrices, subsequently perturbing domain-specific style information within these tokens. We designed two variants: START-M, which uses input-dependent matrices, and START-X, which uses input sequences to compute saliency.
  • Figure 3: Sensitivity to $P_{token}$.
  • Figure 4: Visualization results of our START. The experiments are conducted on the PACS dataset with the "Art" as the target domain. We visualize the attention maps of the last layer in the VMamba backbone. For each sample, the first column is the original image, the second column is the attention map of the baseline (i.e., VMamba), and the third and last columns are the attention maps of our START-X and START-M, respectively. Our methods help the model learn more domain-invariant semantic features, e.g., holistic shape structure, than the pure VMamba baseline.