Table of Contents
Fetching ...

Learning Transferable Sensor Models via Language-Informed Pretraining

Yuliang Chen, Arvind Pillai, Yu Yvonne Wu, Tess Z. Griffin, Lisa Marsch, Michael V. Heinz, Nicholas C. Jacobson, Andrew Campbell

TL;DR

By repurposing a pretrained decoder-only language model via cross-attention and introducing an elegant, flexible patch-embedder, SLIP supports different temporal resolutions and variable-length input at inference time without additional retraining.

Abstract

Modern sensing systems generate large volumes of unlabeled multivariate time-series data. This abundance of unlabeled data makes self-supervised learning (SSL) a natural approach for learning transferable representations. However, most existing approaches are optimized for reconstruction or forecasting objectives and often fail to capture the semantic structure required for downstream classification and reasoning tasks. While recent sensor-language alignment methods improve semantic generalization through captioning and zero-shot transfer, they are limited to fixed sensor configurations, such as predefined channel sets, signal lengths, or temporal resolutions, which hinders cross-domain applicability. To address these gaps, we introduce \textbf{SLIP} (\textbf{S}ensor \textbf{L}anguage-\textbf{I}nformed \textbf{P}retraining), an open-source framework for learning language-aligned representations that generalize across diverse sensor setups. SLIP integrates contrastive alignment with sensor-conditioned captioning, facilitating both discriminative understanding and generative reasoning. By repurposing a pretrained decoder-only language model via cross-attention and introducing an elegant, flexible patch-embedder, SLIP supports different temporal resolutions and variable-length input at inference time without additional retraining. Across 11 datasets, SLIP demonstrates superior performance in zero-shot transfer, signal captioning, and question answering. It achieves a 77.14% average linear-probing accuracy, a 5.93% relative improvement over strong baselines, and reaches 64.83% accuracy in sensor-based question answering.

Learning Transferable Sensor Models via Language-Informed Pretraining

TL;DR

By repurposing a pretrained decoder-only language model via cross-attention and introducing an elegant, flexible patch-embedder, SLIP supports different temporal resolutions and variable-length input at inference time without additional retraining.

Abstract

Modern sensing systems generate large volumes of unlabeled multivariate time-series data. This abundance of unlabeled data makes self-supervised learning (SSL) a natural approach for learning transferable representations. However, most existing approaches are optimized for reconstruction or forecasting objectives and often fail to capture the semantic structure required for downstream classification and reasoning tasks. While recent sensor-language alignment methods improve semantic generalization through captioning and zero-shot transfer, they are limited to fixed sensor configurations, such as predefined channel sets, signal lengths, or temporal resolutions, which hinders cross-domain applicability. To address these gaps, we introduce \textbf{SLIP} (\textbf{S}ensor \textbf{L}anguage-\textbf{I}nformed \textbf{P}retraining), an open-source framework for learning language-aligned representations that generalize across diverse sensor setups. SLIP integrates contrastive alignment with sensor-conditioned captioning, facilitating both discriminative understanding and generative reasoning. By repurposing a pretrained decoder-only language model via cross-attention and introducing an elegant, flexible patch-embedder, SLIP supports different temporal resolutions and variable-length input at inference time without additional retraining. Across 11 datasets, SLIP demonstrates superior performance in zero-shot transfer, signal captioning, and question answering. It achieves a 77.14% average linear-probing accuracy, a 5.93% relative improvement over strong baselines, and reaches 64.83% accuracy in sensor-based question answering.
Paper Structure (26 sections, 5 equations, 15 figures, 16 tables, 2 algorithms)

This paper contains 26 sections, 5 equations, 15 figures, 16 tables, 2 algorithms.

Figures (15)

  • Figure 1: Illustrated example of the forecasting–classification gap. Chronos-2 achieves accurate forecasting on UCI-HAR with low error (MSE = 0.96), yet its learned representations lead to incorrect activity classification (walking downstairs vs. upstairs). This example illustrates that SSL-based models optimized for forecasting do not necessarily learn semantic representations that support downstream classification and understanding.
  • Figure 2: Sensor-Language Informed Pretraining (SLIP) Architecture.
  • Figure 3: Overview of the pretrained SLIP that can be used for downstream tasks, including sensor classification and sensor text retrieval using the frozen encoders, and supports sensor captioning and question answering after supervised finetuning (SFT) to equip it with instruction following ability.
  • Figure 4: Sensor-Language representation geometry Analysis. Sensor Uniformity (left) and Text Uniformity (middle) quantify embedding dispersion on the unit hypersphere, while Sensor–Text Alignment (right) measures the mean distance between paired sensor and text embeddings. Lower values indicate better performance across all metrics.
  • Figure 5: Example MCQ template used for Gemmani-4b-IT and Gemmani-270M-IT.
  • ...and 10 more figures