CiTrus: Squeezing Extra Performance out of Low-data Bio-signal Transfer Learning
Eloy Geenjaar, Lie Lu
TL;DR
CiTrus introduces a convolution–transformer hybrid for bio-signal transfer learning that is particularly effective in low-data regimes. By combining a residual CNN encoder with a channel-independent PatchTST transformer and employing masked auto-encoding, frequency-based pre-training, and multimodal pre-training, the approach yields strong transfer performance across diverse biosignals. A key contribution is a resampling-based transfer technique that aligns pre-training and fine-tuning data distributions, improving cross-dataset generalization. The study shows that convolutional models often excel in low-data transfer, transformers gain most from pre-training, and frequency-aware pre-training achieves top performance across extreme data regimes, with multimodal pre-training providing additional gains on several tasks.
Abstract
Transfer learning for bio-signals has recently become an important technique to improve prediction performance on downstream tasks with small bio-signal datasets. Recent works have shown that pre-training a neural network model on a large dataset (e.g. EEG) with a self-supervised task, replacing the self-supervised head with a linear classification head, and fine-tuning the model on different downstream bio-signal datasets (e.g., EMG or ECG) can dramatically improve the performance on those datasets. In this paper, we propose a new convolution-transformer hybrid model architecture with masked auto-encoding for low-data bio-signal transfer learning, introduce a frequency-based masked auto-encoding task, employ a more comprehensive evaluation framework, and evaluate how much and when (multimodal) pre-training improves fine-tuning performance. We also introduce a dramatically more performant method of aligning a downstream dataset with a different temporal length and sampling rate to the original pre-training dataset. Our findings indicate that the convolution-only part of our hybrid model can achieve state-of-the-art performance on some low-data downstream tasks. The performance is often improved even further with our full model. In the case of transformer-based models we find that pre-training especially improves performance on downstream datasets, multimodal pre-training often increases those gains further, and our frequency-based pre-training performs the best on average for the lowest and highest data regimes.
