Table of Contents
Fetching ...

Flash-Unified: A Training-Free and Task-Aware Acceleration Framework for Native Unified Models

Junlong Ke, Zichen Wen, Boxue Yang, Yantai Yang, Xuyang Liu, Chenfei Liao, Zhaorun Chen, Shaobo Wang, Linfeng Zhang

Abstract

Native unified multimodal models, which integrate both generative and understanding capabilities, face substantial computational overhead that hinders their real-world deployment. Existing acceleration techniques typically employ a static, monolithic strategy, ignoring the fundamental divergence in computational profiles between iterative generation tasks (e.g., image generation) and single-pass understanding tasks (e.g., VQA). In this work, we present the first systematic analysis of unified models, revealing pronounced parameter specialization, where distinct neuron sets are critical for each task. This implies that, at the parameter level, unified models have implicitly internalized separate inference pathways for generation and understanding within a single architecture. Based on these insights, we introduce a training-free and task-aware acceleration framework, FlashU, that tailors optimization to each task's demands. Across both tasks, we introduce Task-Specific Network Pruning and Dynamic Layer Skipping, aiming to eliminate inter-layer and task-specific redundancy. For visual generation, we implement a time-varying control signal for the guidance scale and a temporal approximation for the diffusion head via Diffusion Head Cache. For multimodal understanding, building upon the pruned model, we introduce Dynamic Token Pruning via a V-Norm Proxy to exploit the spatial redundancy of visual inputs. Extensive experiments on Show-o2 demonstrate that FlashU achieves 1.78$\times$ to 2.01$\times$ inference acceleration across both understanding and generation tasks while maintaining SOTA performance, outperforming competing unified models and validating our task-aware acceleration paradigm. Our code is publicly available at https://github.com/Rirayh/FlashU.

Flash-Unified: A Training-Free and Task-Aware Acceleration Framework for Native Unified Models

Abstract

Native unified multimodal models, which integrate both generative and understanding capabilities, face substantial computational overhead that hinders their real-world deployment. Existing acceleration techniques typically employ a static, monolithic strategy, ignoring the fundamental divergence in computational profiles between iterative generation tasks (e.g., image generation) and single-pass understanding tasks (e.g., VQA). In this work, we present the first systematic analysis of unified models, revealing pronounced parameter specialization, where distinct neuron sets are critical for each task. This implies that, at the parameter level, unified models have implicitly internalized separate inference pathways for generation and understanding within a single architecture. Based on these insights, we introduce a training-free and task-aware acceleration framework, FlashU, that tailors optimization to each task's demands. Across both tasks, we introduce Task-Specific Network Pruning and Dynamic Layer Skipping, aiming to eliminate inter-layer and task-specific redundancy. For visual generation, we implement a time-varying control signal for the guidance scale and a temporal approximation for the diffusion head via Diffusion Head Cache. For multimodal understanding, building upon the pruned model, we introduce Dynamic Token Pruning via a V-Norm Proxy to exploit the spatial redundancy of visual inputs. Extensive experiments on Show-o2 demonstrate that FlashU achieves 1.78 to 2.01 inference acceleration across both understanding and generation tasks while maintaining SOTA performance, outperforming competing unified models and validating our task-aware acceleration paradigm. Our code is publicly available at https://github.com/Rirayh/FlashU.
Paper Structure (34 sections, 8 equations, 12 figures, 8 tables, 1 algorithm)

This paper contains 34 sections, 8 equations, 12 figures, 8 tables, 1 algorithm.

Figures (12)

  • Figure 1: Task-specific FFN neuron analysis for generation and understanding. Each layer shows the fraction of neurons that are specific to generation, understanding, or shared between both tasks (excluding unimportant neurons). Percentages indicate the proportion of neurons in each layer ranked among the top 50% for task importance, as measured by OBD-inspired sensitivity scores $\Delta_i$.
  • Figure 2: Understanding Redundancy.Left: Intra-layer cosine similarity between the input and output features of each layer for newly generated tokens in understanding tasks. Right: Inter-layer heatmap of the cosine similarity among layer outputs for newly generated tokens. All measurements were conducted on samples from the MME mme.
  • Figure 3: Illustration of the FlashU acceleration framework. (a) Task-Specific FFN Pruning calculates an importance score $I_j$ based on activation norms and weight magnitude to statically mask redundant neurons for a specific task. (b) Dynamic Token Pruning utilizes the V-Norm proxy at shallow layers (the 2nd layer) to identify and prune visually spatially redundant tokens. (c) Dynamic Layer Skipping bypasses layers exhibiting high input-output cosine similarity ($S_i$), re-evaluating the skipping mask every $T_{LS}$ steps. (d) Diffusion Head Cache exploits temporal coherence in the generative process, computing the full diffusion head only at fixed intervals and reusing cached hidden states for intermediate steps.
  • Figure 4: Redundancy in Generation Tasks.Left: Intra-layer cosine similarity between the input and output of each layer. Right: Inter-layer heatmap of hidden state cosine similarity. All measurements were conducted on samples from the GenEval geneval.
  • Figure 5: The relation between attention score and the norm of the value matrix for each token. Tokens with the largest attention score tend to have a lower norm in the value matrix.
  • ...and 7 more figures