Scaled and Inter-token Relation Enhanced Transformer for Sample-restricted Residential NILM
Minhajur Rahman, Yasir Arafat
TL;DR
This work targets NILM with transformer models trained on small datasets, where standard self-attention tends to over-smooth and overly rely on intra-token relations. It introduces two mechanisms: inter-token relation enhancement, which removes diagonal entries from the similarity matrix to emphasize inter-token interactions, and dynamic temperature tuning, a learnable sharpness parameter controlled by a meta-network to adapt attention during training. Empirical results on the REDD dataset show consistent improvements over the original transformer and several SOTA approaches, with 10–15% gains in F1 across multiple appliances and only modest increases in computation. The approach offers a lightweight, training-efficient path to better NILM performance under data-constrained conditions, with future work aimed at larger datasets and further compute optimizations.
Abstract
Transformers have demonstrated exceptional performance across various domains due to their self-attention mechanism, which captures complex relationships in data. However, training on smaller datasets poses challenges, as standard attention mechanisms can over-smooth attention scores and overly prioritize intra-token relationships, reducing the capture of meaningful inter-token dependencies critical for tasks like Non-Intrusive Load Monitoring (NILM). To address this, we propose a novel transformer architecture with two key innovations: inter-token relation enhancement and dynamic temperature tuning. The inter-token relation enhancement mechanism removes diagonal entries in the similarity matrix to improve attention focus on inter-token relations. The dynamic temperature tuning mechanism, a learnable parameter, adapts attention sharpness during training, preventing over-smoothing and enhancing sensitivity to token relationships. We validate our method on the REDD dataset and show that it outperforms the original transformer and state-of-the-art models by 10-15\% in F1 score across various appliance types, demonstrating its efficacy for training on smaller datasets.
