Table of Contents
Fetching ...

Scaled and Inter-token Relation Enhanced Transformer for Sample-restricted Residential NILM

Minhajur Rahman, Yasir Arafat

TL;DR

This work targets NILM with transformer models trained on small datasets, where standard self-attention tends to over-smooth and overly rely on intra-token relations. It introduces two mechanisms: inter-token relation enhancement, which removes diagonal entries from the similarity matrix to emphasize inter-token interactions, and dynamic temperature tuning, a learnable sharpness parameter controlled by a meta-network to adapt attention during training. Empirical results on the REDD dataset show consistent improvements over the original transformer and several SOTA approaches, with 10–15% gains in F1 across multiple appliances and only modest increases in computation. The approach offers a lightweight, training-efficient path to better NILM performance under data-constrained conditions, with future work aimed at larger datasets and further compute optimizations.

Abstract

Transformers have demonstrated exceptional performance across various domains due to their self-attention mechanism, which captures complex relationships in data. However, training on smaller datasets poses challenges, as standard attention mechanisms can over-smooth attention scores and overly prioritize intra-token relationships, reducing the capture of meaningful inter-token dependencies critical for tasks like Non-Intrusive Load Monitoring (NILM). To address this, we propose a novel transformer architecture with two key innovations: inter-token relation enhancement and dynamic temperature tuning. The inter-token relation enhancement mechanism removes diagonal entries in the similarity matrix to improve attention focus on inter-token relations. The dynamic temperature tuning mechanism, a learnable parameter, adapts attention sharpness during training, preventing over-smoothing and enhancing sensitivity to token relationships. We validate our method on the REDD dataset and show that it outperforms the original transformer and state-of-the-art models by 10-15\% in F1 score across various appliance types, demonstrating its efficacy for training on smaller datasets.

Scaled and Inter-token Relation Enhanced Transformer for Sample-restricted Residential NILM

TL;DR

This work targets NILM with transformer models trained on small datasets, where standard self-attention tends to over-smooth and overly rely on intra-token relations. It introduces two mechanisms: inter-token relation enhancement, which removes diagonal entries from the similarity matrix to emphasize inter-token interactions, and dynamic temperature tuning, a learnable sharpness parameter controlled by a meta-network to adapt attention during training. Empirical results on the REDD dataset show consistent improvements over the original transformer and several SOTA approaches, with 10–15% gains in F1 across multiple appliances and only modest increases in computation. The approach offers a lightweight, training-efficient path to better NILM performance under data-constrained conditions, with future work aimed at larger datasets and further compute optimizations.

Abstract

Transformers have demonstrated exceptional performance across various domains due to their self-attention mechanism, which captures complex relationships in data. However, training on smaller datasets poses challenges, as standard attention mechanisms can over-smooth attention scores and overly prioritize intra-token relationships, reducing the capture of meaningful inter-token dependencies critical for tasks like Non-Intrusive Load Monitoring (NILM). To address this, we propose a novel transformer architecture with two key innovations: inter-token relation enhancement and dynamic temperature tuning. The inter-token relation enhancement mechanism removes diagonal entries in the similarity matrix to improve attention focus on inter-token relations. The dynamic temperature tuning mechanism, a learnable parameter, adapts attention sharpness during training, preventing over-smoothing and enhancing sensitivity to token relationships. We validate our method on the REDD dataset and show that it outperforms the original transformer and state-of-the-art models by 10-15\% in F1 score across various appliance types, demonstrating its efficacy for training on smaller datasets.

Paper Structure

This paper contains 15 sections, 2 theorems, 10 equations, 2 figures, 3 tables.

Key Result

Lemma 1

Impact of Diagonal Entries on Attention Distribution. Let $\mathbf{A} \in \mathbb{R}^{n \times n}$ be the attention matrix computed from the softmax-normalized similarity matrix $\mathbf{S}$. If $\mathbf{S}_{ii}$ has large positive values, the $\mathtt{softmax}$ operator $\mathtt{softmax}(\mathbf{S}

Figures (2)

  • Figure 1: Effect of scaling factor $\sqrt{d_k}$ on the attention score distribution. Higher values of $d_k$ lead to smoother distributions, which reduce the model's ability to capture meaningful relationships. Note that x is a randomly sampled array and $x = \mathtt{\{0.1081, 0.4376, 0.7697, 0.1929, 0.3626, 2.8451\}}$.
  • Figure 2: Overview of our proposed method. Given an aggregated load signal, we embed to a higher dimensional space with our embedding block (top left). We then pass it to transformer blocks based on our proposed inter-token relation enhanced self-attention mechanism (middle) and learnable dynamic temperature mechanism (bottom right). Finally, we produce target appliance signals with our reconstruction block (top right). Best viewed in Zoom.

Theorems & Definitions (6)

  • Definition 1
  • Lemma 1
  • Remark 1
  • Definition 2
  • Lemma 2
  • Remark 2