Jointly Modeling Spatio-Temporal Features of Tactile Signals for Action Classification
Jimmy Lin, Junkai Li, Jiasi Gao, Weizhi Ma, Yang Liu
TL;DR
This work addresses action classification from tactile signals by introducing STAT, a transformer architecture that jointly models spatial and temporal features through dedicated embeddings and a temporal pretraining task. The model converts tactile data into tubelets, enriches them with spatial and temporal cues, and processes them with multi-layer transformers, using a CLS token for classification. Pretraining combines masked tubelet reconstruction with a temporal order discrimination objective, which together improve feature learning for spatio-temporal tactile signals. Empirical results on a public tactile dataset show STAT outperforms state-of-the-art baselines across ACC@1, ACC@3, and Macro-F1, demonstrating the practical value of explicitly incorporating spatial translation variance in tactile sensing for robust action recognition.
Abstract
Tactile signals collected by wearable electronics are essential in modeling and understanding human behavior. One of the main applications of tactile signals is action classification, especially in healthcare and robotics. However, existing tactile classification methods fail to capture the spatial and temporal features of tactile signals simultaneously, which results in sub-optimal performances. In this paper, we design Spatio-Temporal Aware tactility Transformer (STAT) to utilize continuous tactile signals for action classification. We propose spatial and temporal embeddings along with a new temporal pretraining task in our model, which aims to enhance the transformer in modeling the spatio-temporal features of tactile signals. Specially, the designed temporal pretraining task is to differentiate the time order of tubelet inputs to model the temporal properties explicitly. Experimental results on a public action classification dataset demonstrate that our model outperforms state-of-the-art methods in all metrics.
