Towards Diverse Perspective Learning with Selection over Multiple Temporal Poolings

Jihyeon Seong; Jungmin Kim; Jaesik Choi

Towards Diverse Perspective Learning with Selection over Multiple Temporal Poolings

Jihyeon Seong, Jungmin Kim, Jaesik Choi

TL;DR

This paper tackles the challenge that no single temporal pooling method universally captures the temporal structure of time series data. It proposes SoM-TP, an attention-based selection mechanism over multiple temporal poolings (GTP, STP, DTP) within a single classifier, augmented by a Diverse Perspective Learning Network (DPLN) and a perspective loss to regularize learning across pooling perspectives. The approach enables non-iterative, batch-wise pooling selection inspired by Multiple Choice Learning, and is complemented by LRP-based analysis to demonstrate diverse perspective learning. Empirical results on extensive UCR/UEA benchmarks show SoM-TP surpasses traditional pooling methods and many state-of-the-art TSC models, with robust performance and informative attribution patterns, underscoring the value of pooling-level ensemble in time series classification.

Abstract

In Time Series Classification (TSC), temporal pooling methods that consider sequential information have been proposed. However, we found that each temporal pooling has a distinct mechanism, and can perform better or worse depending on time series data. We term this fixed pooling mechanism a single perspective of temporal poolings. In this paper, we propose a novel temporal pooling method with diverse perspective learning: Selection over Multiple Temporal Poolings (SoM-TP). SoM-TP dynamically selects the optimal temporal pooling among multiple methods for each data by attention. The dynamic pooling selection is motivated by the ensemble concept of Multiple Choice Learning (MCL), which selects the best among multiple outputs. The pooling selection by SoM-TP's attention enables a non-iterative pooling ensemble within a single classifier. Additionally, we define a perspective loss and Diverse Perspective Learning Network (DPLN). The loss works as a regularizer to reflect all the pooling perspectives from DPLN. Our perspective analysis using Layer-wise Relevance Propagation (LRP) reveals the limitation of a single perspective and ultimately demonstrates diverse perspective learning of SoM-TP. We also show that SoM-TP outperforms CNN models based on other temporal poolings and state-of-the-art models in TSC with extensive UCR/UEA repositories.

Towards Diverse Perspective Learning with Selection over Multiple Temporal Poolings

TL;DR

Abstract

Paper Structure (43 sections, 10 equations, 9 figures, 8 tables, 2 algorithms)

This paper contains 43 sections, 10 equations, 9 figures, 8 tables, 2 algorithms.

Introduction
Background
Different Perspectives between Temporal Poolings
Convolutional Neural Network in TSC
Global Temporal Pooling
Static Temporal Pooling
Dynamic Temporal Pooling
Limitation of Single Perspective
Multiple Choice Learning for Deep Temporal Pooling
Selection over Multiple Temporal Pooling
SoM-TP Architecture and Selection Ensemble
DPL Attention
Diverse Perspective Learning Network and Perspective Loss
Perspective Loss
Optimization
...and 28 more sections

Figures (9)

Figure 1: Perspectives of Temporal Poolings. Depending on segmentation types, each temporal pooling generates different pooling outputs and has different perspectives.
Figure 2: SoM-TP Architecture. Diverse Perspective Learning based on selection-ensemble is achieved as follows: The aggregated output of all pooling, $\bar{\mathbf{P}}$ is passed to the attention block to calculate the attention score $\mathbf{A}$. In the attention block, a weighted pooling output $\mathbf{M}$ is formed by the multiplication of $\bar{\mathbf{P}}$ and a learnable weight vector $\mathbf{A_0}$. After $\mathbf{M}$ passes through the convolutional layer $\phi_0$, the attention score $\mathbf{A}$ is drawn out as an encoded weight vector. Using the index of the highest attention score (here, index 3), pooling for the CLS network is selected. Next, the parameters are updated with the following procedure: 1) DPLN uses the ensembled vector $\mathbf{E}$, whereas CLS network uses only the selected pooling output (here, $\mathbf{p_s}$); 2) Each network predicts $y_{CLS}$ and $y_{DPL}$ respectively, and $y_{DPL}$ is used in the perspective loss to work as a regularizer; 3) With these two outputs, the model is optimized with diverse perspectives while selecting the proper pooling method for each batch.
Figure 3: Dynamic Pooling Selection in SoM-TP. This figure represents the graph of dynamic selection in the FCN SoM-TP MAX on the UCR repository: ArrowHead, Chinatown, and ACSF1.
Figure 4: Comparison of LRP Input Attribution on Single vs Diverse Perspective Learning. The figure shows LRP attribution results for FaceAll, FiftyWords, and MoteStrain datasets in the UCR repository, ordered in respective rows. Pooling choice significantly affects accuracy and attributions, reflecting different perspectives. Redder areas of time series indicate higher attribution, aligning with LRP's conservation rule of summing to 1. Blue circles denote well-captured regions, while red circles suggest dispersed focus or inadequate capture. Given the absence of a ground truth concept for input attributes in TSC, we infer these implicitly from the presented accuracy.
Figure 5: Convolutional Stack Architectures. The overall architectures of FCN and ResNet used in this paper follow the baseline models presented by wang2017time, specifically designed for Time Series Classification (TSC). In both architectures, a convolutional layer preceding pooling and FC layers for classification decisions are implemented. FCN consists of three convolutional layers, with hidden dimension sizes sequentially set at 128, 256, and 256, and kernel sizes of 9, 5, and 3. ResNet comprises nine convolutional layers, with hidden dimension sizes sequentially set as 64 $\times$ 3, 128 $\times$ 3, 256 $\times$ 3, and kernel sizes of (9, 5, 3) $\times$ 3. SoM-TP is applied to the single pooling layer just before the FC layer in both model architectures. The FC layers of the CLS network and DPLN consist of three linear layers each. The FC layers of the CLS network are structured as follows: 256 * number of segments $\times$ 512, 512 $\times$ 1024, 1024 $\times$ number of classes. On the other hand, the FC layers of the DPLN are structured as: 256 * number of segments * 3 $\times$ 512, 512 $\times$ 1024, 1024 $\times$ number of classes.
...and 4 more figures

Towards Diverse Perspective Learning with Selection over Multiple Temporal Poolings

TL;DR

Abstract

Towards Diverse Perspective Learning with Selection over Multiple Temporal Poolings

Authors

TL;DR

Abstract

Table of Contents

Figures (9)