Rethinking Pruning for Vision-Language Models: Strategies for Effective Sparsity and Performance Restoration
Shwai He, Ang Li, Tianlong Chen
TL;DR
This work addresses the practicality of pruning Vision-Language Models (VLMs) by investigating how sparsity should be distributed across vision and language components and how to restore performance after pruning. It reveals that equal sparsity across modalities often yields near-optimal results, while pruning only the language branch offers efficiency with trade-offs, and introduces SparseLoRA, a sparsity-preserving finetuning method that applies masks to LoRA increments and enables knowledge distillation from the original dense model. The approach combines a calibration-based pruning step with two finetuning objectives (task recovery and KL-based distillation) to recover performance across unstructured and structured sparsity patterns, achieving substantial gains (e.g., about $+11.3\%$ at $2{:}4$ sparsity and $+47.6\%$ at $70\%$ unstructured sparsity) on multiple VLM tasks. The results demonstrate the universality and practicality of SparseLoRA for VLM sparsification, offering a scalable pathway toward efficient multimodal models in resource-constrained settings.
Abstract
Vision-Language Models (VLMs) integrate information from multiple modalities and have shown remarkable success across various tasks. However, deploying large-scale VLMs in resource-constrained scenarios is challenging. Pruning followed by finetuning offers a potential solution but remains underexplored for VLMs. This study addresses two key questions: how to distribute sparsity across different modality-specific models, and how to restore the performance of pruned sparse VLMs. Our preliminary studies identified two effective pruning settings: applying the same sparsity to both vision and language models, and pruning only the language models. While LoRA finetuning aims to restore sparse models, it faces challenges due to incompatibility with sparse models, disrupting the pruned sparsity. To overcome these issues, we propose SparseLoRA, which applies sparsity directly to LoRA weights. Our experimental results demonstrate significant improvements, including an 11.3\% boost under 2:4 sparsity and a 47.6\% enhancement under unstructured 70\% sparsity. Code is released at: \url{https://github.com/Shwai-He/VLM-Compression}.
