Table of Contents
Fetching ...

Dataset-Driven Channel Masks in Transformers for Multivariate Time Series

Seunghan Lee, Taeyoung Park, Kibok Lee

TL;DR

The paper addresses the challenge of modeling channel dependencies in multivariate time series by proposing partial channel dependence (PCD), a dataset-aware refinement of channel dependence in Transformer attention. It introduces Channel Masks (CMs), which combine a dataset-wide channel similarity matrix with learnable domain parameters to produce a mask that modulates attention: $\mathbf{M} = \sigma(\alpha \cdot \bar{\mathbf{R}} + \beta)$ with $\bar{\mathbf{R}} = |\mathbf{R}| - \text{mean}(|\mathbf{R}|)$. CMs are applied as a multiplicative mask in the attention computation, enabling CD by dataset in addition to model-driven CD, and the framework includes a CD ratio $r(\mathbf{M})$ to quantify dataset-specific CD strength. Experiments across single-task models and a multi-task TSFM (UniTS) demonstrate consistent improvements in forecasting (and related tasks) across 13 datasets, with notable gains in few-shot and zero-shot settings and efficient deployment since $\mathbf{R}$ can be precomputed. The work emphasizes the practical importance of incorporating dataset-specific information into CD modeling and provides a reusable plug-in for Transformer-based TS models.

Abstract

Recent advancements in foundation models have been successfully extended to the time series (TS) domain, facilitated by the emergence of large-scale TS datasets. However, previous efforts have primarily Capturing channel dependency (CD) is essential for modeling multivariate time series (TS), and attention-based methods have been widely employed for this purpose. Nonetheless, these methods primarily focus on modifying the architecture, often neglecting the importance of dataset-specific characteristics. In this work, we introduce the concept of partial channel dependence (PCD) to enhance CD modeling in Transformer-based models by leveraging dataset-specific information to refine the CD captured by the model. To achieve PCD, we propose channel masks (CMs), which are integrated into the attention matrices of Transformers via element-wise multiplication. CMs consist of two components: 1) a similarity matrix that captures relationships between the channels, and 2) dataset-specific and learnable domain parameters that refine the similarity matrix. We validate the effectiveness of PCD across diverse tasks and datasets with various backbones. Code is available at this repository: https://github.com/YonseiML/pcd.

Dataset-Driven Channel Masks in Transformers for Multivariate Time Series

TL;DR

The paper addresses the challenge of modeling channel dependencies in multivariate time series by proposing partial channel dependence (PCD), a dataset-aware refinement of channel dependence in Transformer attention. It introduces Channel Masks (CMs), which combine a dataset-wide channel similarity matrix with learnable domain parameters to produce a mask that modulates attention: with . CMs are applied as a multiplicative mask in the attention computation, enabling CD by dataset in addition to model-driven CD, and the framework includes a CD ratio to quantify dataset-specific CD strength. Experiments across single-task models and a multi-task TSFM (UniTS) demonstrate consistent improvements in forecasting (and related tasks) across 13 datasets, with notable gains in few-shot and zero-shot settings and efficient deployment since can be precomputed. The work emphasizes the practical importance of incorporating dataset-specific information into CD modeling and provides a reusable plug-in for Transformer-based TS models.

Abstract

Recent advancements in foundation models have been successfully extended to the time series (TS) domain, facilitated by the emergence of large-scale TS datasets. However, previous efforts have primarily Capturing channel dependency (CD) is essential for modeling multivariate time series (TS), and attention-based methods have been widely employed for this purpose. Nonetheless, these methods primarily focus on modifying the architecture, often neglecting the importance of dataset-specific characteristics. In this work, we introduce the concept of partial channel dependence (PCD) to enhance CD modeling in Transformer-based models by leveraging dataset-specific information to refine the CD captured by the model. To achieve PCD, we propose channel masks (CMs), which are integrated into the attention matrices of Transformers via element-wise multiplication. CMs consist of two components: 1) a similarity matrix that captures relationships between the channels, and 2) dataset-specific and learnable domain parameters that refine the similarity matrix. We validate the effectiveness of PCD across diverse tasks and datasets with various backbones. Code is available at this repository: https://github.com/YonseiML/pcd.

Paper Structure

This paper contains 36 sections, 2 equations, 29 figures, 27 tables.

Figures (29)

  • Figure 1: CI vs. CD vs. PCD framework. Under the partial channel dependence (PCD) framework, CD captured by model is adjusted with channel mask (i.e., CD captured by dataset).
  • Figure 2: Necessity of CD by dataset. (a) presents CDs captured by model, dataset, and both, along with their distributions, where CD by model is adjusted with CD by dataset within the PCD framework. (b) shows the TS forecasting results using (model) iTransformer, (dataset) replacing attention matrix of iTransformer with CM, and (both) iTransformer with CM, highlighting the importance of leveraging the dataset itself.
  • Figure 3: Channel Mask. CM consists of 1) a similarity matrix between channels and 2) domain parameters to refine the similarity matrix.
  • Figure 4: Necessity of domain parameters. As similarity metric is a relative measure depending on the dataset, we employ domain parameters to adjust similarity matrix. Specifically, we refine the matrix with 1) mean normalization and 2) domain parameters, resulting in $\mathbf{M} = \sigma(\alpha \cdot \bar{\mathbf{R}} + \beta)$.
  • Figure 5: Global & local dependencies. CM and attention matrix are complementary in capturing global and local dependencies, respectively, since CM is constructed from the entire TS, while the attention matrix is computed from the input TS (segment of the entire TS).
  • ...and 24 more figures