Beyond Next-Token Alignment: Distilling Multimodal Large Language Models via Token Interactions

Lin Chen; Xiaoke Zhao; Kun Ding; Weiwei Feng; Changtao Miao; Zili Wang; Wenxuan Guo; Ying Wang; Kaiyuan Zheng; Bo Zhang; Zhe Li; Shiming Xiang

Beyond Next-Token Alignment: Distilling Multimodal Large Language Models via Token Interactions

Lin Chen, Xiaoke Zhao, Kun Ding, Weiwei Feng, Changtao Miao, Zili Wang, Wenxuan Guo, Ying Wang, Kaiyuan Zheng, Bo Zhang, Zhe Li, Shiming Xiang

TL;DR

The paper addresses the inefficiency of distilling multimodal LLMs with static next-token alignment by introducing Align-TI, a token-interaction–based KD framework. It decomposes knowledge transfer into two interaction types—vision-instruction token interactions and intra-response token interactions—and implements IVA and TPA to mimic the teacher's visual grounding and autoregressive generation dynamics, respectively, guided by IRS. Empirical results show Align-TI yields state-of-the-art performance among compact ~1B–2B MLLMs and even outperforms larger baselines on several benchmarks, with a modest training overhead and improved inference efficiency. This work advances practical deployment of multimodal LLMs by enabling efficient, high-fidelity distillation through explicit token-interaction modeling.

Abstract

Multimodal Large Language Models (MLLMs) demonstrate impressive cross-modal capabilities, yet their substantial size poses significant deployment challenges. Knowledge distillation (KD) is a promising solution for compressing these models, but existing methods primarily rely on static next-token alignment, neglecting the dynamic token interactions, which embed essential capabilities for multimodal understanding and generation. To this end, we introduce Align-TI, a novel KD framework designed from the perspective of Token Interactions. Our approach is motivated by the insight that MLLMs rely on two primary interactions: vision-instruction token interactions to extract relevant visual information, and intra-response token interactions for coherent generation. Accordingly, Align-TI introduces two components: IVA enables the student model to imitate the teacher's instruction-relevant visual information extract capability by aligning on salient visual regions. TPA captures the teacher's dynamic generative logic by aligning the sequential token-to-token transition probabilities. Extensive experiments demonstrate Align-TI's superiority. Notably, our approach achieves $2.6\%$ relative improvement over Vanilla KD, and our distilled Align-TI-2B even outperforms LLaVA-1.5-7B (a much larger MLLM) by $7.0\%$, establishing a new state-of-the-art distillation framework for training parameter-efficient MLLMs. Code is available at https://github.com/lchen1019/Align-TI.

Beyond Next-Token Alignment: Distilling Multimodal Large Language Models via Token Interactions

TL;DR

Abstract

relative improvement over Vanilla KD, and our distilled Align-TI-2B even outperforms LLaVA-1.5-7B (a much larger MLLM) by

, establishing a new state-of-the-art distillation framework for training parameter-efficient MLLMs. Code is available at https://github.com/lchen1019/Align-TI.

Paper Structure (38 sections, 14 equations, 17 figures, 18 tables)

This paper contains 38 sections, 14 equations, 17 figures, 18 tables.

Introduction
Preliminaries
Framework of Align-TI
Instruction-aware Vision Alignment
Transition Probability Alignment
Overall Objective of Align-TI
Experimental Results
Main Results
Ablation Study
Analysis on IVA and TPA
Scaling Analysis
Conclusion
Related Work
Implementation Details
Training Details
...and 23 more sections

Figures (17)

Figure 1: Experimental results overview. Left: Performance comparison between MLLMs distilled using our proposed Align-TI and other state-of-the-art MLLMs. Right: Performance gains achieved by Align-TI relative to the SFT and Vanilla KD baselines. (Details provided in Appendix \ref{['appendix:fig1-details']}.)
Figure 2: Motivation of MLLM distillation in view of token interactions. Left: Vision-instruction token interaction analysis. Visualizations of instruction-to-vision attention weights demonstrate that different instructions activate distinct visual focus areas, while exhibiting significant token redundancy. Right: Intra-response token interaction analysis. The discrepancy between data-conditioned prefix during training-time and self-conditioned prefix during test-time amplifies autoregressive accumulated error. (More details are provided in Appendix \ref{['sec:accerr-intro']}.)
Figure 3: Overview of the proposed Align-TI. The framework explicitly models MLLM KD from the perspective of token interactions.
Figure 4: Analysis of IRS across layers with Qwen2-7B-based bai2025qwen2 and Vicuna-7B-based zheng2023judging MLLMs.
Figure 5: Ablation study on TPA design choices: comparing different sampling strategies and sampled token number $d$.
...and 12 more figures

Theorems & Definitions (2)

Definition 3.1: Instruction-Relevance Score
Definition 2.1: Excess Accumulation Error

Beyond Next-Token Alignment: Distilling Multimodal Large Language Models via Token Interactions

TL;DR

Abstract

Beyond Next-Token Alignment: Distilling Multimodal Large Language Models via Token Interactions

Authors

TL;DR

Abstract

Table of Contents

Figures (17)

Theorems & Definitions (2)