Dataset-Driven Channel Masks in Transformers for Multivariate Time Series
Seunghan Lee, Taeyoung Park, Kibok Lee
TL;DR
The paper addresses the challenge of modeling channel dependencies in multivariate time series by proposing partial channel dependence (PCD), a dataset-aware refinement of channel dependence in Transformer attention. It introduces Channel Masks (CMs), which combine a dataset-wide channel similarity matrix with learnable domain parameters to produce a mask that modulates attention: $\mathbf{M} = \sigma(\alpha \cdot \bar{\mathbf{R}} + \beta)$ with $\bar{\mathbf{R}} = |\mathbf{R}| - \text{mean}(|\mathbf{R}|)$. CMs are applied as a multiplicative mask in the attention computation, enabling CD by dataset in addition to model-driven CD, and the framework includes a CD ratio $r(\mathbf{M})$ to quantify dataset-specific CD strength. Experiments across single-task models and a multi-task TSFM (UniTS) demonstrate consistent improvements in forecasting (and related tasks) across 13 datasets, with notable gains in few-shot and zero-shot settings and efficient deployment since $\mathbf{R}$ can be precomputed. The work emphasizes the practical importance of incorporating dataset-specific information into CD modeling and provides a reusable plug-in for Transformer-based TS models.
Abstract
Recent advancements in foundation models have been successfully extended to the time series (TS) domain, facilitated by the emergence of large-scale TS datasets. However, previous efforts have primarily Capturing channel dependency (CD) is essential for modeling multivariate time series (TS), and attention-based methods have been widely employed for this purpose. Nonetheless, these methods primarily focus on modifying the architecture, often neglecting the importance of dataset-specific characteristics. In this work, we introduce the concept of partial channel dependence (PCD) to enhance CD modeling in Transformer-based models by leveraging dataset-specific information to refine the CD captured by the model. To achieve PCD, we propose channel masks (CMs), which are integrated into the attention matrices of Transformers via element-wise multiplication. CMs consist of two components: 1) a similarity matrix that captures relationships between the channels, and 2) dataset-specific and learnable domain parameters that refine the similarity matrix. We validate the effectiveness of PCD across diverse tasks and datasets with various backbones. Code is available at this repository: https://github.com/YonseiML/pcd.
