Table of Contents
Fetching ...

Enhancing User Experience in On-Device Machine Learning with Gated Compression Layers

Haiguang Li, Usama Pervaiz, Joseph Antognini, Michał Matuszak, Lawrence Au, Gilles Roux, Trausti Thormundsson

TL;DR

This work tackles the power and UX challenges of on-device ML by introducing Gated Compression (GC) layers that selectively gate neuron activations and promote activation sparsity, enabling efficient always-on inference across CNNs and Transformer models. By partitioning networks into sub-networks and enabling early stopping on negatives, GC layers reduce data movement and computation while preserving accuracy, with reported power-cost reductions up to 158×–30,000× and $0.003%$–$0.63%$ of baseline power. The approach yields consistent gains in precision and recall across vision, speech, and ViT tasks, and demonstrates meaningful UX benefits such as longer battery life and faster responsiveness. The work also discusses GC depth placement, theta parameters for early stopping, and the extension to transformers, highlighting potential hardware co-design and real-world testing to validate UX improvements.

Abstract

On-device machine learning (ODML) enables powerful edge applications, but power consumption remains a key challenge for resource-constrained devices. To address this, developers often face a trade-off between model accuracy and power consumption, employing either computationally intensive models on high-power cores or pared-down models on low-power cores. Both approaches typically lead to a compromise in user experience (UX). This work focuses on the use of Gated Compression (GC) layer to enhance ODML model performance while conserving power and maximizing cost-efficiency, especially for always-on use cases. GC layers dynamically regulate data flow by selectively gating activations of neurons within the neural network and effectively filtering out non-essential inputs, which reduces power needs without compromising accuracy, and enables more efficient execution on heterogeneous compute cores. These improvements enhance UX through prolonged battery life, improved device responsiveness, and greater user comfort. In this work, we have integrated GC layers into vision and speech domain models including the transformer-based ViT model. Our experiments demonstrate theoretical power efficiency gains ranging from 158x to 30,000x for always-on scenarios. This substantial improvement empowers ODML applications with enhanced UX benefits.

Enhancing User Experience in On-Device Machine Learning with Gated Compression Layers

TL;DR

This work tackles the power and UX challenges of on-device ML by introducing Gated Compression (GC) layers that selectively gate neuron activations and promote activation sparsity, enabling efficient always-on inference across CNNs and Transformer models. By partitioning networks into sub-networks and enabling early stopping on negatives, GC layers reduce data movement and computation while preserving accuracy, with reported power-cost reductions up to 158×–30,000× and of baseline power. The approach yields consistent gains in precision and recall across vision, speech, and ViT tasks, and demonstrates meaningful UX benefits such as longer battery life and faster responsiveness. The work also discusses GC depth placement, theta parameters for early stopping, and the extension to transformers, highlighting potential hardware co-design and real-world testing to validate UX improvements.

Abstract

On-device machine learning (ODML) enables powerful edge applications, but power consumption remains a key challenge for resource-constrained devices. To address this, developers often face a trade-off between model accuracy and power consumption, employing either computationally intensive models on high-power cores or pared-down models on low-power cores. Both approaches typically lead to a compromise in user experience (UX). This work focuses on the use of Gated Compression (GC) layer to enhance ODML model performance while conserving power and maximizing cost-efficiency, especially for always-on use cases. GC layers dynamically regulate data flow by selectively gating activations of neurons within the neural network and effectively filtering out non-essential inputs, which reduces power needs without compromising accuracy, and enables more efficient execution on heterogeneous compute cores. These improvements enhance UX through prolonged battery life, improved device responsiveness, and greater user comfort. In this work, we have integrated GC layers into vision and speech domain models including the transformer-based ViT model. Our experiments demonstrate theoretical power efficiency gains ranging from 158x to 30,000x for always-on scenarios. This substantial improvement empowers ODML applications with enhanced UX benefits.
Paper Structure (19 sections, 2 equations, 9 figures, 3 tables)

This paper contains 19 sections, 2 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: Two GC layers added to existing architectures to transform any network into an efficient Always-On Gated Neural Network.
  • Figure 2: Comparative Analysis of sub-model size (i.e., initial network before GC layer) relative to the placement depth of the GC layer for ImageNet and Speech Command experiments.
  • Figure 3: Learning Rate Schedules, depicting the Cosine Decay schedule for ImageNet and the Piecewise Constant Decay schedule for speech command.
  • Figure 4: Performance Impact of GC layer on the ImageNet dataset. This figure illustrates the precision, recall, gating performance, and activation sparsity at various depths of GC layer integration within the network architecture, indicating the GC layer's influence on the overall model performance for the person and dog detection tasks.
  • Figure 5: GC layer impact on the Speech Command Dataset. Shown here are the performance metrics of precision, recall, gating performance, and activation sparsity highlighting the effect of GC layer insertion at different depths of the network. For example, GCL@1, the first GC layer is inserted at 10% depth of the baseline network, whereas the GCL@3, the third GC layer is inserted at 30% depth of the baseline network.
  • ...and 4 more figures