Role of Mixup in Topological Persistence Based Knowledge Distillation for Wearable Sensor Data
Eun Som Jeon, Hongjun Choi, Matthew P. Buman, Pavan Turaga
TL;DR
The paper tackles the challenge of deploying topological features from wearable sensor data by studying the role of mixup in knowledge distillation with multiple teachers (time-series and persistence images). It analyzes single- and multi-teacher KD setups and various mixup strategies, including temperature-driven smoothing and partial mixup, to distill a time-series–only student. Key findings show that annealing with multiple teachers typically yields the best performance, mixup provides beneficial smoothness but excessive smoothing can hurt, and different teachers may require different mixup hyperparameters; partial mixup can mitigate over-smoothing. The work provides practical guidance for efficient, topologically informed KD in wearable sensing and offers a framework potentially extendable to other multimodal domains and computer vision tasks.
Abstract
The analysis of wearable sensor data has enabled many successes in several applications. To represent the high-sampling rate time-series with sufficient detail, the use of topological data analysis (TDA) has been considered, and it is found that TDA can complement other time-series features. Nonetheless, due to the large time consumption and high computational resource requirements of extracting topological features through TDA, it is difficult to deploy topological knowledge in various applications. To tackle this problem, knowledge distillation (KD) can be adopted, which is a technique facilitating model compression and transfer learning to generate a smaller model by transferring knowledge from a larger network. By leveraging multiple teachers in KD, both time-series and topological features can be transferred, and finally, a superior student using only time-series data is distilled. On the other hand, mixup has been popularly used as a robust data augmentation technique to enhance model performance during training. Mixup and KD employ similar learning strategies. In KD, the student model learns from the smoothed distribution generated by the teacher model, while mixup creates smoothed labels by blending two labels. Hence, this common smoothness serves as the connecting link that establishes a connection between these two methods. In this paper, we analyze the role of mixup in KD with time-series as well as topological persistence, employing multiple teachers. We present a comprehensive analysis of various methods in KD and mixup on wearable sensor data.
