Transport and Merge: Cross-Architecture Merging for Large Language Models
Chenhang Cui, Binyun Yang, Fei Shen, Yuxin Chen, Jingnan Zheng, Xiang Wang, An Zhang, Tat-Seng Chua
TL;DR
This work tackles cross-architecture knowledge transfer by aligning intermediate activations of heterogeneous LLMs with an optimal-transport formulation. By deriving cross-model feature and layer correspondences, it enables targeted weight-space fusion through selective neuron replacement, using only a small calibration set and optional residual-frozen adaptation. The approach yields consistent improvements across multiple low-resource languages and expert domains, and proves robust to source backbone choices while providing a principled representation-space interpretation of weight transport. It offers a practical alternative to distillation for scenarios where architectures differ, with implications for rapid, data-efficient adaptation to new languages and domains.
Abstract
Large language models (LLMs) achieve strong capabilities by scaling model capacity and training data, yet many real-world deployments rely on smaller models trained or adapted from low-resource data. This gap motivates the need for mechanisms to transfer knowledge from large, high-resource models to smaller, low-resource targets. While model merging provides an effective transfer mechanism, most existing approaches assume architecture-compatible models and therefore cannot directly transfer knowledge from large high-resource LLMs to heterogeneous low-resource targets. In this work, we propose a cross-architecture merging framework based on optimal transport (OT) that aligns activations to infer cross-neuron correspondences between heterogeneous models. The resulting transport plans are then used to guide direct weight-space fusion, enabling effective high-resource to low-resource transfer using only a small set of inputs. Extensive experiments across low-resource languages and specialized domains demonstrate consistent improvements over target models.
