Model Compression Method for S4 with Diagonal State Space Layers using Balanced Truncation
Haruka Ezoe, Kazuhiro Sato
TL;DR
The paper tackles enabling S4 models with DSS layers to run on edge devices by introducing a balanced truncation–based model compression that reduces state dimension and uses the reduced parameters to initialize main training. This approach yields higher accuracy on Long Range Arena tasks with fewer parameters than models initialized via Skew-HiPPO, and it reveals a positive link between pre-trained and post-training performance. The findings suggest that state-space model reduction can effectively preserve and even enhance long-range sequence modeling capabilities in resource-constrained settings, informing practical deployment and future theory in SSM-based architectures. Potential extensions include combining this compression with other methods and applying it to physics-informed neural networks and broader edge applications.
Abstract
To implement deep learning models on edge devices, model compression methods have been widely recognized as useful. However, it remains unclear which model compression methods are effective for Structured State Space Sequence (S4) models incorporating Diagonal State Space (DSS) layers, tailored for processing long-sequence data. In this paper, we propose to use the balanced truncation, a prevalent model reduction technique in control theory, applied specifically to DSS layers in pre-trained S4 model as a novel model compression method. Moreover, we propose using the reduced model parameters obtained by the balanced truncation as initial parameters of S4 models with DSS layers during the main training process. Numerical experiments demonstrate that our trained models combined with the balanced truncation surpass conventionally trained models with Skew-HiPPO initialization in accuracy, even with fewer parameters. Furthermore, our observations reveal a positive correlation: higher accuracy in the original model consistently leads to increased accuracy in models trained using our model compression method, suggesting that our approach effectively leverages the strengths of the original model.
