Table of Contents
Fetching ...

Polymorph: Energy-Efficient Multi-Label Classification for Video Streams on Embedded Devices

Saeid Ghafouri, Mohsen Fayyaz, Xiangchen Li, Deepu John, Bo Ji, Dimitrios Nikolopoulos, Hans Vandierendonck

TL;DR

Polymorph tackles on-device, real-time multi-label video classification by exploiting the structural properties of video streams—label sparsity, temporal continuity, and label co-occurrence. It introduces context-aware LoRA adapters deployed on a shared backbone, with a two-stage process: training-time co-occurrence clustering to form compact label contexts, and inference-time greedy context detection to activate a minimal, cover-ensuring set of adapters per frame. By applying LoRA only to the final layers and using a parallel, composable forward without merging base weights, Polymorph achieves substantial energy efficiency and accuracy gains on the TAO benchmark (approximately $40\%$ energy reduction and $+9$ mAP points) while maintaining real-time latency on embedded hardware. This approach offers a scalable, Flexible framework for edge video analytics, enabling efficient handling of large label spaces without full-model switching or duplication.

Abstract

Real-time multi-label video classification on embedded devices is constrained by limited compute and energy budgets. Yet, video streams exhibit structural properties such as label sparsity, temporal continuity, and label co-occurrence that can be leveraged for more efficient inference. We introduce Polymorph, a context-aware framework that activates a minimal set of lightweight Low Rank Adapters (LoRA) per frame. Each adapter specializes in a subset of classes derived from co-occurrence patterns and is implemented as a LoRA weight over a shared backbone. At runtime, Polymorph dynamically selects and composes only the adapters needed to cover the active labels, avoiding full-model switching and weight merging. This modular strategy improves scalability while reducing latency and energy overhead. Polymorph achieves 40% lower energy consumption and improves mAP by 9 points over strong baselines on the TAO dataset. Polymorph is open source at https://github.com/inference-serving/polymorph/.

Polymorph: Energy-Efficient Multi-Label Classification for Video Streams on Embedded Devices

TL;DR

Polymorph tackles on-device, real-time multi-label video classification by exploiting the structural properties of video streams—label sparsity, temporal continuity, and label co-occurrence. It introduces context-aware LoRA adapters deployed on a shared backbone, with a two-stage process: training-time co-occurrence clustering to form compact label contexts, and inference-time greedy context detection to activate a minimal, cover-ensuring set of adapters per frame. By applying LoRA only to the final layers and using a parallel, composable forward without merging base weights, Polymorph achieves substantial energy efficiency and accuracy gains on the TAO benchmark (approximately energy reduction and mAP points) while maintaining real-time latency on embedded hardware. This approach offers a scalable, Flexible framework for edge video analytics, enabling efficient handling of large label spaces without full-model switching or duplication.

Abstract

Real-time multi-label video classification on embedded devices is constrained by limited compute and energy budgets. Yet, video streams exhibit structural properties such as label sparsity, temporal continuity, and label co-occurrence that can be leveraged for more efficient inference. We introduce Polymorph, a context-aware framework that activates a minimal set of lightweight Low Rank Adapters (LoRA) per frame. Each adapter specializes in a subset of classes derived from co-occurrence patterns and is implemented as a LoRA weight over a shared backbone. At runtime, Polymorph dynamically selects and composes only the adapters needed to cover the active labels, avoiding full-model switching and weight merging. This modular strategy improves scalability while reducing latency and energy overhead. Polymorph achieves 40% lower energy consumption and improves mAP by 9 points over strong baselines on the TAO dataset. Polymorph is open source at https://github.com/inference-serving/polymorph/.

Paper Structure

This paper contains 10 sections, 4 equations, 8 figures, 4 tables, 2 algorithms.

Figures (8)

  • Figure 1: (a) Smaller model can surpass larger variant accuracy on smaller subset of classes (5-80 increments of 5), comparing two ViT variants trained on COCO dataset for multi-label classification task, (b) ViT base and DeiT tiny with 86.6 M and 5 M parameters count and 2x less energy consumption.
  • Figure 2: S-LoRA vs Polymorph architecture: (a) applying LoRA to all layers (baseline); (b) Polymorph applies LoRA only to the last 20% of the layers. Colored bars indicate LoRA adapters.
  • Figure 3: Comparison of adaptation strategies with three active classifiers. Polymorph applies LoRA adapters only to the final layers of a shared backbone, avoiding other methods overhead.
  • Figure 4: Latency and power as the number of active LoRA adapters increases. Full-model LoRAs (blue and green) incur higher costs per additional adapter. Polymorph (orange and red), which adapts only the final layers, incurs minimal additional cost.
  • Figure 5: Overview of the Polymorph system architecture. (a) Training: context-specific LoRA adapters are trained based on label co-occurrence. (b) Inference: input passes through a shared backbone and is processed in parallel by the base and context-specific LoRA heads. The base output is compared with the previous frame to detect context changes. If a change is detected, the context detection algorithm updates the selected LoRA adapters. All active LoRA outputs are merged to generate the final results.
  • ...and 3 more figures