MULTIFLOW: Shifting Towards Task-Agnostic Vision-Language Pruning

Matteo Farina; Massimiliano Mancini; Elia Cunegatti; Gaowen Liu; Giovanni Iacca; Elisa Ricci

MULTIFLOW: Shifting Towards Task-Agnostic Vision-Language Pruning

Matteo Farina, Massimiliano Mancini, Elia Cunegatti, Gaowen Liu, Giovanni Iacca, Elisa Ricci

TL;DR

This work addresses the challenge of pruning vision-language models in a task-agnostic manner, aiming to produce a single sparse subnet that transfers to unknown downstream tasks. It introduces Multimodal Flow Pruning (MULTIFLOW), a gradient-free approach that scores parameters by combining edge magnitude with saliencies of connected input and output nodes, and employs a multimodal prior to guide layer-wise sparsity. The method is evaluated on two VLMs (BLIP and XVLM) and three vision-language tasks (ITR, IC, VQA) under 63% and 75% sparsity, consistently outperforming eight baselines and showing robustness under extreme sparsity. The results demonstrate that preserving the emergent, multimodal information flow from pretraining enables transferable sparsity, offering practical benefits for deploying VLMs on memory-constrained devices and reducing pruning costs.

Abstract

While excellent in transfer learning, Vision-Language models (VLMs) come with high computational costs due to their large number of parameters. To address this issue, removing parameters via model pruning is a viable solution. However, existing techniques for VLMs are task-specific, and thus require pruning the network from scratch for each new task of interest. In this work, we explore a new direction: Task-Agnostic Vision-Language Pruning (TA-VLP). Given a pretrained VLM, the goal is to find a unique pruned counterpart transferable to multiple unknown downstream tasks. In this challenging setting, the transferable representations already encoded in the pretrained model are a key aspect to preserve. Thus, we propose Multimodal Flow Pruning (MULTIFLOW), a first, gradient-free, pruning framework for TA-VLP where: (i) the importance of a parameter is expressed in terms of its magnitude and its information flow, by incorporating the saliency of the neurons it connects; and (ii) pruning is driven by the emergent (multimodal) distribution of the VLM parameters after pretraining. We benchmark eight state-of-the-art pruning algorithms in the context of TA-VLP, experimenting with two VLMs, three vision-language tasks, and three pruning ratios. Our experimental results show that MULTIFLOW outperforms recent sophisticated, combinatorial competitors in the vast majority of the cases, paving the way towards addressing TA-VLP. The code is publicly available at https://github.com/FarinaMatteo/multiflow.

MULTIFLOW: Shifting Towards Task-Agnostic Vision-Language Pruning

TL;DR

Abstract

Paper Structure (24 sections, 8 equations, 4 figures, 11 tables)

This paper contains 24 sections, 8 equations, 4 figures, 11 tables.

Introduction
Related Work
Task-Agnostic Vision-Language Pruning
multiflow: Multimodal Flow Pruning
Modeling the Information Flow
Multimodality-aware compression
Experiments
Image-Text Retrieval (ITR)
Image Captioning (IC)
Visual Question Answering (VQA)
Additional Analyses
Extreme sparsity and different prunability
Ablations and sanity checks on multiflow
Conclusions
Additional Experiments
...and 9 more sections

Figures (4)

Figure 1: The conceptual difference between existing VLM pruning methods shi2023upopwang2022efficientvlm and our proposed Task-Agnostic Vision-Language Pruning. While existing pruning methods use task-specific knowledge, hence requiring pruning the dense model from scratch for different tasks, we propose to shift the perspective and formalize TA-VLP, which only requires pruning once.
Figure 2: multiflow. Orange trapezoids represent groups of parameters processing different modalities (i) To compute the information flow score for a parameter $\theta_{lr}$, multiflow combines the importance of the input neuron $l$ and that of output neuron $r$, aggregating them via the local hop from $l$ to $r$ through $\theta_{lr}$ (ii) A global saliency score is obtained by computing (i) for all edges, and a global modality-aware distribution that exploits the emergent properties of large-scale pretraining guides layer-wise pruning.
Figure 3: Experiments at $90\%$ sparsity. ITR with XVLM (left) - VQA with both BLIP and XVLM (center) - IC with both BLIP and XVLM (right). The random and dense baselines are also reported. All experiments follow the same configuration as those of Tabs. \ref{['tab:itr']} and \ref{['tab:vqa-cap']}.
Figure 4: Comparison of the sparsities obtained at each layer $\ell$ of each modality by (i) pruning with the $\mathtt{topk}$ global scores of multiflow (denoted by w/o distribution), (ii) omp (w/o multimodality) and (iii) multiflow. The figure displays XVLM.

MULTIFLOW: Shifting Towards Task-Agnostic Vision-Language Pruning

TL;DR

Abstract

MULTIFLOW: Shifting Towards Task-Agnostic Vision-Language Pruning

Authors

TL;DR

Abstract

Table of Contents

Figures (4)