INTERLACE: Interleaved Layer Pruning and Efficient Adaptation in Large Vision-Language Models
Parsa Madinei, Ryan Solgi, Ziqi Wen, Jonathan Skaza, Miguel Eckstein, Ramtin Pedarsani
TL;DR
INTERLACE tackles the high compute cost of large vision-language models by identifying local redundancy in triplets of consecutive layers using cosine similarity and stabilizing fine-tuning with an interleaved freeze-anchor design. It selects $K=loor{\rho L}$ layers to drop by prioritizing high triplet similarity, freezes the triplet anchor layer, and finetunes only the remaining layer in each triplet, enabling rapid convergence with minimal data. On Qwen3-VL-Instruct 8B/4B models, INTERLACE achieves about $94\%$ of baseline performance with 10% pruning and around $86\%$ with 25% pruning, while delivering inference speedups of up to $1.18\times$ and outperforming several pruning baselines, including dense-finetuned variants. This approach enables deployment of high-capacity LVLMs in resource-constrained environments and provides a practical framework for structured architectural modification with constrained training.
Abstract
We introduce INTERLACE, a novel framework that prunes redundant layers in VLMs while maintaining performance through sample-efficient finetuning. Existing layer pruning methods lead to significant performance drop when applied to VLMs. Instead, we analyze triplets of consecutive layers to identify local redundancy, removing the most redundant of the first two layers, finetune the remaining layer to compensate for the lost capacity, and freeze the third layer to serve as a stable anchor during finetuning. We found that this interleaved finetune-freeze design enables rapid convergence with minimal data after pruning. By finetuning only a subset of layers on just 1% of the FineVision dataset for one epoch, Interlace achieves 88.9% average performance retention after dropping 25% of the network, achieving SOTA performance. Our code is available at: https://github.com/pmadinei/Interlace.git
