Table of Contents
Fetching ...

INTERLACE: Interleaved Layer Pruning and Efficient Adaptation in Large Vision-Language Models

Parsa Madinei, Ryan Solgi, Ziqi Wen, Jonathan Skaza, Miguel Eckstein, Ramtin Pedarsani

TL;DR

INTERLACE tackles the high compute cost of large vision-language models by identifying local redundancy in triplets of consecutive layers using cosine similarity and stabilizing fine-tuning with an interleaved freeze-anchor design. It selects $K= loor{\rho L}$ layers to drop by prioritizing high triplet similarity, freezes the triplet anchor layer, and finetunes only the remaining layer in each triplet, enabling rapid convergence with minimal data. On Qwen3-VL-Instruct 8B/4B models, INTERLACE achieves about $94\%$ of baseline performance with 10% pruning and around $86\%$ with 25% pruning, while delivering inference speedups of up to $1.18\times$ and outperforming several pruning baselines, including dense-finetuned variants. This approach enables deployment of high-capacity LVLMs in resource-constrained environments and provides a practical framework for structured architectural modification with constrained training.

Abstract

We introduce INTERLACE, a novel framework that prunes redundant layers in VLMs while maintaining performance through sample-efficient finetuning. Existing layer pruning methods lead to significant performance drop when applied to VLMs. Instead, we analyze triplets of consecutive layers to identify local redundancy, removing the most redundant of the first two layers, finetune the remaining layer to compensate for the lost capacity, and freeze the third layer to serve as a stable anchor during finetuning. We found that this interleaved finetune-freeze design enables rapid convergence with minimal data after pruning. By finetuning only a subset of layers on just 1% of the FineVision dataset for one epoch, Interlace achieves 88.9% average performance retention after dropping 25% of the network, achieving SOTA performance. Our code is available at: https://github.com/pmadinei/Interlace.git

INTERLACE: Interleaved Layer Pruning and Efficient Adaptation in Large Vision-Language Models

TL;DR

INTERLACE tackles the high compute cost of large vision-language models by identifying local redundancy in triplets of consecutive layers using cosine similarity and stabilizing fine-tuning with an interleaved freeze-anchor design. It selects layers to drop by prioritizing high triplet similarity, freezes the triplet anchor layer, and finetunes only the remaining layer in each triplet, enabling rapid convergence with minimal data. On Qwen3-VL-Instruct 8B/4B models, INTERLACE achieves about of baseline performance with 10% pruning and around with 25% pruning, while delivering inference speedups of up to and outperforming several pruning baselines, including dense-finetuned variants. This approach enables deployment of high-capacity LVLMs in resource-constrained environments and provides a practical framework for structured architectural modification with constrained training.

Abstract

We introduce INTERLACE, a novel framework that prunes redundant layers in VLMs while maintaining performance through sample-efficient finetuning. Existing layer pruning methods lead to significant performance drop when applied to VLMs. Instead, we analyze triplets of consecutive layers to identify local redundancy, removing the most redundant of the first two layers, finetune the remaining layer to compensate for the lost capacity, and freeze the third layer to serve as a stable anchor during finetuning. We found that this interleaved finetune-freeze design enables rapid convergence with minimal data after pruning. By finetuning only a subset of layers on just 1% of the FineVision dataset for one epoch, Interlace achieves 88.9% average performance retention after dropping 25% of the network, achieving SOTA performance. Our code is available at: https://github.com/pmadinei/Interlace.git

Paper Structure

This paper contains 21 sections, 2 equations, 4 figures, 6 tables, 1 algorithm.

Figures (4)

  • Figure 1: Interlace is a layer pruning framework for VLMs. It first identifies local redundancy by calculating cosine similarity for "triplets" of layers. In each selected triplet, the most redundant of the first two layers is dropped, the other is fine-tuned, and the third is frozen to act as a stable anchor. The performance comparison (top right) shows that Interlace outperforms alternative pruning methods by 28.4%.
  • Figure 2: Triplet selection and layer assignment based on cosine similarity scores in Qwen3-VL-8B. Selected triplets are highlighted with their layer assignments: the individual layer with the highest similarity score between the first two for dropping (red), the other layer in the first two for fine-tuning (cyan), and the last layer for freezing (blue). Unselected triplets remain frozen. Individual layer similarity scores within selected triplets are normalized to fit within the triplet's overall similarity range.
  • Figure 3: Effect of fine-tuning on baseline performance with COT enabled. Benchmarks below the dashed line at 1.0 show performance degradation after fine-tuning without layer pruning.
  • Figure 4: Distribution of cosine similarity scores for individual layers and triplets across the depth of Qwen3-VL-8B (blue) and 4B (orange) models. Higher scores indicate greater redundancy.