Table of Contents
Fetching ...

InfiGFusion: Graph-on-Logits Distillation via Efficient Gromov-Wasserstein for Model Fusion

Yuanyi Wang, Zhaoyi Yan, Yiming Zhang, Qi Zhou, Yanggan Gu, Fei Wu, Hongxia Yang

TL;DR

InfiGFusion tackles the challenge of fusing heterogeneous LLMs by moving beyond token-level logit alignment to structure-aware graph alignment of logit activations. It introduces Graph-on-Logits Distillation (GLD) and a scalable Gromov-Wasserstein (GW) based objective, approximated in closed form to O(n log n) with a provable error bound, enabling efficient multi-source fusion. The method combines a unified GLD with token-level distillation (ULD) and supervised signals (SFT) to distill diverse source models into a single pivot model, maintaining inference efficiency. Empirical results across 11 benchmarks show substantial improvements in complex reasoning tasks, multi-step arithmetic, and causal judgment, validating the importance of structure-preserving alignment for robust model fusion in heterogeneous LLM ecosystems.

Abstract

Recent advances in large language models (LLMs) have intensified efforts to fuse heterogeneous open-source models into a unified system that inherits their complementary strengths. Existing logit-based fusion methods maintain inference efficiency but treat vocabulary dimensions independently, overlooking semantic dependencies encoded by cross-dimension interactions. These dependencies reflect how token types interact under a model's internal reasoning and are essential for aligning models with diverse generation behaviors. To explicitly model these dependencies, we propose \textbf{InfiGFusion}, the first structure-aware fusion framework with a novel \textit{Graph-on-Logits Distillation} (GLD) loss. Specifically, we retain the top-$k$ logits per output and aggregate their outer products across sequence positions to form a global co-activation graph, where nodes represent vocabulary channels and edges quantify their joint activations. To ensure scalability and efficiency, we design a sorting-based closed-form approximation that reduces the original $O(n^4)$ cost of Gromov-Wasserstein distance to $O(n \log n)$, with provable approximation guarantees. Experiments across multiple fusion settings show that GLD consistently improves fusion quality and stability. InfiGFusion outperforms SOTA models and fusion baselines across 11 benchmarks spanning reasoning, coding, and mathematics. It shows particular strength in complex reasoning tasks, with +35.6 improvement on Multistep Arithmetic and +37.06 on Causal Judgement over SFT, demonstrating superior multi-step and relational inference.

InfiGFusion: Graph-on-Logits Distillation via Efficient Gromov-Wasserstein for Model Fusion

TL;DR

InfiGFusion tackles the challenge of fusing heterogeneous LLMs by moving beyond token-level logit alignment to structure-aware graph alignment of logit activations. It introduces Graph-on-Logits Distillation (GLD) and a scalable Gromov-Wasserstein (GW) based objective, approximated in closed form to O(n log n) with a provable error bound, enabling efficient multi-source fusion. The method combines a unified GLD with token-level distillation (ULD) and supervised signals (SFT) to distill diverse source models into a single pivot model, maintaining inference efficiency. Empirical results across 11 benchmarks show substantial improvements in complex reasoning tasks, multi-step arithmetic, and causal judgment, validating the importance of structure-preserving alignment for robust model fusion in heterogeneous LLM ecosystems.

Abstract

Recent advances in large language models (LLMs) have intensified efforts to fuse heterogeneous open-source models into a unified system that inherits their complementary strengths. Existing logit-based fusion methods maintain inference efficiency but treat vocabulary dimensions independently, overlooking semantic dependencies encoded by cross-dimension interactions. These dependencies reflect how token types interact under a model's internal reasoning and are essential for aligning models with diverse generation behaviors. To explicitly model these dependencies, we propose \textbf{InfiGFusion}, the first structure-aware fusion framework with a novel \textit{Graph-on-Logits Distillation} (GLD) loss. Specifically, we retain the top- logits per output and aggregate their outer products across sequence positions to form a global co-activation graph, where nodes represent vocabulary channels and edges quantify their joint activations. To ensure scalability and efficiency, we design a sorting-based closed-form approximation that reduces the original cost of Gromov-Wasserstein distance to , with provable approximation guarantees. Experiments across multiple fusion settings show that GLD consistently improves fusion quality and stability. InfiGFusion outperforms SOTA models and fusion baselines across 11 benchmarks spanning reasoning, coding, and mathematics. It shows particular strength in complex reasoning tasks, with +35.6 improvement on Multistep Arithmetic and +37.06 on Causal Judgement over SFT, demonstrating superior multi-step and relational inference.

Paper Structure

This paper contains 55 sections, 5 theorems, 70 equations, 5 figures, 10 tables.

Key Result

Proposition 1

Let $C \in \mathbb{R}^{n \times n}$ and $D \in \mathbb{R}^{m \times m}$ be the similarity matrices derived from logit self-inner products and row-normalized such that $\sum_{j} C_{ij} = 1$ and $\sum_{l} D_{kl} = 1$. Then the absolute error between the exact and approximated GW distances satisfies:

Figures (5)

  • Figure 1: Token-level vs. Structure-aware Fusion. Given pivot and source logits of shape $[L,3]$ (sequence length $L$, vocab size $3$), token-level methods (left) align dimensions independently, ignoring token interactions. GLD (right) aggregates outer products into $[3,3]$ co-activation graphs, capturing semantic dependencies via structure-aware graph alignment.
  • Figure 2: InfiGFusion framework. Given instruction-response pairs, source and pivot models produce logits, sparsified into feature-level graphs capturing semantic dependencies. We align graphs via an efficient Gromov-Wasserstein approximation (GLD), reducing complexity from $\mathcal{O}(n^4)$ to $\mathcal{O}(n \log n)$. The overall objective combines structure-aware distillation (GLD) with token-level distillation (ULD) and supervised signals (SFT) for robust fusion.
  • Figure 3: Top-k analysis.
  • Figure 4: Case study.
  • Figure 5: Comparison of WD and GW distributions during fusion. Left: before distillation; Middle: after token-level WD optimization. WD reduces significantly, but GWD remains largely unchanged, indicating semantic dependency misalignment; Right: after GLD optimization.

Theorems & Definitions (9)

  • Proposition 1: Approximation Error Bound
  • Proposition 2: Lipschitz Constants Comparison
  • proof
  • Lemma 1: GW Lipschitz constant
  • proof
  • Lemma 2: KL loss Lipschitz constant
  • proof
  • Lemma 3: Lipschitz constant of the 1-Wasserstein loss
  • proof