Table of Contents
Fetching ...

Model Compression Method for S4 with Diagonal State Space Layers using Balanced Truncation

Haruka Ezoe, Kazuhiro Sato

TL;DR

The paper tackles enabling S4 models with DSS layers to run on edge devices by introducing a balanced truncation–based model compression that reduces state dimension and uses the reduced parameters to initialize main training. This approach yields higher accuracy on Long Range Arena tasks with fewer parameters than models initialized via Skew-HiPPO, and it reveals a positive link between pre-trained and post-training performance. The findings suggest that state-space model reduction can effectively preserve and even enhance long-range sequence modeling capabilities in resource-constrained settings, informing practical deployment and future theory in SSM-based architectures. Potential extensions include combining this compression with other methods and applying it to physics-informed neural networks and broader edge applications.

Abstract

To implement deep learning models on edge devices, model compression methods have been widely recognized as useful. However, it remains unclear which model compression methods are effective for Structured State Space Sequence (S4) models incorporating Diagonal State Space (DSS) layers, tailored for processing long-sequence data. In this paper, we propose to use the balanced truncation, a prevalent model reduction technique in control theory, applied specifically to DSS layers in pre-trained S4 model as a novel model compression method. Moreover, we propose using the reduced model parameters obtained by the balanced truncation as initial parameters of S4 models with DSS layers during the main training process. Numerical experiments demonstrate that our trained models combined with the balanced truncation surpass conventionally trained models with Skew-HiPPO initialization in accuracy, even with fewer parameters. Furthermore, our observations reveal a positive correlation: higher accuracy in the original model consistently leads to increased accuracy in models trained using our model compression method, suggesting that our approach effectively leverages the strengths of the original model.

Model Compression Method for S4 with Diagonal State Space Layers using Balanced Truncation

TL;DR

The paper tackles enabling S4 models with DSS layers to run on edge devices by introducing a balanced truncation–based model compression that reduces state dimension and uses the reduced parameters to initialize main training. This approach yields higher accuracy on Long Range Arena tasks with fewer parameters than models initialized via Skew-HiPPO, and it reveals a positive link between pre-trained and post-training performance. The findings suggest that state-space model reduction can effectively preserve and even enhance long-range sequence modeling capabilities in resource-constrained settings, informing practical deployment and future theory in SSM-based architectures. Potential extensions include combining this compression with other methods and applying it to physics-informed neural networks and broader edge applications.

Abstract

To implement deep learning models on edge devices, model compression methods have been widely recognized as useful. However, it remains unclear which model compression methods are effective for Structured State Space Sequence (S4) models incorporating Diagonal State Space (DSS) layers, tailored for processing long-sequence data. In this paper, we propose to use the balanced truncation, a prevalent model reduction technique in control theory, applied specifically to DSS layers in pre-trained S4 model as a novel model compression method. Moreover, we propose using the reduced model parameters obtained by the balanced truncation as initial parameters of S4 models with DSS layers during the main training process. Numerical experiments demonstrate that our trained models combined with the balanced truncation surpass conventionally trained models with Skew-HiPPO initialization in accuracy, even with fewer parameters. Furthermore, our observations reveal a positive correlation: higher accuracy in the original model consistently leads to increased accuracy in models trained using our model compression method, suggesting that our approach effectively leverages the strengths of the original model.
Paper Structure (22 sections, 1 theorem, 43 equations, 7 figures, 9 tables)

This paper contains 22 sections, 1 theorem, 43 equations, 7 figures, 9 tables.

Key Result

Proposition 1

Suppose that the parameters $A=\mathrm{diag}(\lambda_1,\cdots,\lambda_N),B,C,\Delta$ of DSS eq:d-SSM are given, and define $K := \bar{K}_{\Delta,L}(A,B,C) \in \mathbb{C}^{L}$. Then, there exist $w,\Tilde{w}\in \mathbb{C}^{1 \times N}$ satisfying the following equations: where

Figures (7)

  • Figure 1: Edge Intelligence (EI). In EI, data gathered from various devices is not processed entirely in the cloud but rather locally on each device. These devices, like sensors in industrial settings, face limitations that make it difficult to deploy large-scale deep learning models typically trained in the cloud, due to constraints such as computing resources and power consumption.
  • Figure 2: Deep learning model with DSS layers. This represents the overall architecture of the deep learning model used in this study, with the intermediate DSS layer being the most critical component.
  • Figure 3: DSS layer, which consists of $H$ DSS models, nonlinear connection blocks, and a linear combination block.
  • Figure 4: Proposed method, which consists of Pre-Training, DSS Reduction, Parameter Extraction, and Main Training. At DSS Reduction step, we use the balanced truncation method.
  • Figure 5: Hankel singular values obtained from each SSM for DSS${}_\text{SOFTMAX}$ with $N=128$. These SSMs were part of the Pre-Trained model initialized using the Skew-HiPPO with $N=128$, as shown in Table \ref{['table:softmax']}. Although the hidden size was set to $16$, we only presented the cases for $H=1$, $2$, $3$, and $4$ because the results for $H=5, 6, \ldots, 16$ were almost identical.
  • ...and 2 more figures

Theorems & Definitions (1)

  • Proposition 1