IMG2IMU: Translating Knowledge from Large-Scale Images to IMU Sensing Applications
Hyungjun Yoon, Hyeongheon Cha, Hoang C. Nguyen, Taesik Gong, Sung-Ju Lee
TL;DR
This work tackles the data scarcity in IMU sensing by importing knowledge from large-scale vision datasets. It converts triaxial IMU signals into spectrogram images and pre-trains a vision model using sensor-aware contrastive learning, notably via MoCo with a tailored InfoNCE loss. The resulting IMG2IMU framework demonstrates clear performance gains (average ~9.6 percentage points in macro-F1) across diverse IMU tasks with limited labeled data, and maintains feasible on-device latency. The study also provides insights into the importance of augmentation design and spectrogram-based representations for cross-modal transfer from vision to sensing, with strong evidence from Grad-CAM visualizations and ablation analyses.
Abstract
Pre-training representations acquired via self-supervised learning could achieve high accuracy on even tasks with small training data. Unlike in vision and natural language processing domains, pre-training for IMU-based applications is challenging, as there are few public datasets with sufficient size and diversity to learn generalizable representations. To overcome this problem, we propose IMG2IMU that adapts pre-trained representation from large-scale images to diverse IMU sensing tasks. We convert the sensor data into visually interpretable spectrograms for the model to utilize the knowledge gained from vision. We further present a sensor-aware pre-training method for images that enables models to acquire particularly impactful knowledge for IMU sensing applications. This involves using contrastive learning on our augmentation set customized for the properties of sensor data. Our evaluation with four different IMU sensing tasks shows that IMG2IMU outperforms the baselines pre-trained on sensor data by an average of 9.6%p F1-score, illustrating that vision knowledge can be usefully incorporated into IMU sensing applications where only limited training data is available.
