Table of Contents
Fetching ...

On-Demand Multi-Task Sparsity for Efficient Large-Model Deployment on Edge Devices

Lianming Huang, Haibo Hu, Qiao Li, Nan Guan, Chun Jason Xue

TL;DR

This work tackles the challenge of deploying large vision-language models for multi-task perception on edge devices by introducing an on-demand, multi-task sparsity framework. It combines block-level weight splitting, overlap-aware sparsity alignment, and correlation-aware pre-loading to minimize task-switch I/O while preserving accuracy. The approach yields substantial improvements in task-switch latency and GPU memory efficiency, demonstrated on real-vehicle deployments and multi-task benchmarks with state-of-the-art accuracy–latency–sparsity trade-offs. This system-level sparsity design enables practical, scalable edge deployment of large multimodal models for safety-critical applications like autonomous driving.

Abstract

Sparsity is essential for deploying large models on resource constrained edge platforms. However, optimizing sparsity patterns for individual tasks in isolation ignores the significant I/O overhead incurred during frequent task switching. We introduce an on-demand multi-task sparsity framework specifically designed to minimize switching costs by maximizing parameter reuse. Unlike monolithic approaches, we decompose weights into reusable block-granular units and align sparse structures across tasks to maximize overlap. By dynamically loading only the small differential set of blocks required for the next task, our method effectively mitigates the cold-start latency inherent in traditional monolithic approaches.Experiments on a real-world autonomous driving platform demonstrate that our framework achieves superior switching efficiency, accelerating task switching by over 6.6X on average compared to existing sparsity methods.

On-Demand Multi-Task Sparsity for Efficient Large-Model Deployment on Edge Devices

TL;DR

This work tackles the challenge of deploying large vision-language models for multi-task perception on edge devices by introducing an on-demand, multi-task sparsity framework. It combines block-level weight splitting, overlap-aware sparsity alignment, and correlation-aware pre-loading to minimize task-switch I/O while preserving accuracy. The approach yields substantial improvements in task-switch latency and GPU memory efficiency, demonstrated on real-vehicle deployments and multi-task benchmarks with state-of-the-art accuracy–latency–sparsity trade-offs. This system-level sparsity design enables practical, scalable edge deployment of large multimodal models for safety-critical applications like autonomous driving.

Abstract

Sparsity is essential for deploying large models on resource constrained edge platforms. However, optimizing sparsity patterns for individual tasks in isolation ignores the significant I/O overhead incurred during frequent task switching. We introduce an on-demand multi-task sparsity framework specifically designed to minimize switching costs by maximizing parameter reuse. Unlike monolithic approaches, we decompose weights into reusable block-granular units and align sparse structures across tasks to maximize overlap. By dynamically loading only the small differential set of blocks required for the next task, our method effectively mitigates the cold-start latency inherent in traditional monolithic approaches.Experiments on a real-world autonomous driving platform demonstrate that our framework achieves superior switching efficiency, accelerating task switching by over 6.6X on average compared to existing sparsity methods.

Paper Structure

This paper contains 18 sections, 9 equations, 5 figures, 2 tables, 1 algorithm.

Figures (5)

  • Figure 1: Task-switching frequency graph and sort, where arrow directions indicate switches between tasks and edge labels denote how often each transition occurs.
  • Figure 2: Overview of the proposed block-layer selective loading framework for multi-modal tasks. Inputs from the vision and text encoders are processed through a shared Transformer backbone, where task-specific layers are dynamically loaded from storage to GPU memory based on the selected cutting range.
  • Figure 3: Pairwise Jaccard similarity heatmaps for different task settings.
  • Figure 4: Execution order of perception tasks on our autonomous vehicle, with each image illustrating a key scene in the sequence (Car → Traffic Light → Car → Obstacle → Person).
  • Figure 5: Task-switching latency.