Table of Contents
Fetching ...

OmniSIFT: Modality-Asymmetric Token Compression for Efficient Omni-modal Large Language Models

Yue Ding, Yiyan Ji, Jungang Li, Xuyang Liu, Xinlong Chen, Junfei Wu, Bozhou Li, Bohan Zeng, Yang Shi, Yushuo Guan, Yuanxing Zhang, Jiaheng Liu, Qiang Liu, Pengfei Wan, Liang Wang

TL;DR

OmniSIFT tackles the high computational cost of long multimodal token sequences in Omni-LLMs by introducing a modality-asymmetric, two-stage token compression framework. It first prunes video tokens via Spatio-Temporal Video Pruning (STVP) to obtain visual anchors, then selects audio tokens guided by the visual context through Vision-Guided Audio Selector (VGAS), with end-to-end optimization using a straight-through estimator. Across five audio-visual benchmarks, OmniSIFT achieves state-of-the-art compression performance, often matching or exceeding full-token models while using only 25% of the original tokens and adding merely 4.85M parameters. The method yields substantial efficiency gains in latency and memory, enabling robust omni-modal reasoning under constrained compute and facilitating broader practical deployment of Omni-LLMs.

Abstract

Omni-modal Large Language Models (Omni-LLMs) have demonstrated strong capabilities in audio-video understanding tasks. However, their reliance on long multimodal token sequences leads to substantial computational overhead. Despite this challenge, token compression methods designed for Omni-LLMs remain limited. To bridge this gap, we propose OmniSIFT (Omni-modal Spatio-temporal Informed Fine-grained Token compression), a modality-asymmetric token compression framework tailored for Omni-LLMs. Specifically, OmniSIFT adopts a two-stage compression strategy: (i) a spatio-temporal video pruning module that removes video redundancy arising from both intra-frame structure and inter-frame overlap, and (ii) a vision-guided audio selection module that filters audio tokens. The entire framework is optimized end-to-end via a differentiable straight-through estimator. Extensive experiments on five representative benchmarks demonstrate the efficacy and robustness of OmniSIFT. Notably, for Qwen2.5-Omni-7B, OmniSIFT introduces only 4.85M parameters while maintaining lower latency than training-free baselines such as OmniZip. With merely 25% of the original token context, OmniSIFT consistently outperforms all compression baselines and even surpasses the performance of the full-token model on several tasks.

OmniSIFT: Modality-Asymmetric Token Compression for Efficient Omni-modal Large Language Models

TL;DR

OmniSIFT tackles the high computational cost of long multimodal token sequences in Omni-LLMs by introducing a modality-asymmetric, two-stage token compression framework. It first prunes video tokens via Spatio-Temporal Video Pruning (STVP) to obtain visual anchors, then selects audio tokens guided by the visual context through Vision-Guided Audio Selector (VGAS), with end-to-end optimization using a straight-through estimator. Across five audio-visual benchmarks, OmniSIFT achieves state-of-the-art compression performance, often matching or exceeding full-token models while using only 25% of the original tokens and adding merely 4.85M parameters. The method yields substantial efficiency gains in latency and memory, enabling robust omni-modal reasoning under constrained compute and facilitating broader practical deployment of Omni-LLMs.

Abstract

Omni-modal Large Language Models (Omni-LLMs) have demonstrated strong capabilities in audio-video understanding tasks. However, their reliance on long multimodal token sequences leads to substantial computational overhead. Despite this challenge, token compression methods designed for Omni-LLMs remain limited. To bridge this gap, we propose OmniSIFT (Omni-modal Spatio-temporal Informed Fine-grained Token compression), a modality-asymmetric token compression framework tailored for Omni-LLMs. Specifically, OmniSIFT adopts a two-stage compression strategy: (i) a spatio-temporal video pruning module that removes video redundancy arising from both intra-frame structure and inter-frame overlap, and (ii) a vision-guided audio selection module that filters audio tokens. The entire framework is optimized end-to-end via a differentiable straight-through estimator. Extensive experiments on five representative benchmarks demonstrate the efficacy and robustness of OmniSIFT. Notably, for Qwen2.5-Omni-7B, OmniSIFT introduces only 4.85M parameters while maintaining lower latency than training-free baselines such as OmniZip. With merely 25% of the original token context, OmniSIFT consistently outperforms all compression baselines and even surpasses the performance of the full-token model on several tasks.
Paper Structure (31 sections, 8 equations, 12 figures, 8 tables)

This paper contains 31 sections, 8 equations, 12 figures, 8 tables.

Figures (12)

  • Figure 1: Performance comparison across five audio--video benchmarks. Results are obtained using Qwen2.5-Omni-7B with a 35% token retained ratio, comparing OmniSIFT against three baseline token compression methods and the full-token baseline.
  • Figure 2: Compression paradigm comparison for Omni-LLMs. Token compression for Omni-LLMs can be categorized into three paradigms: (a) modality-decoupled compression (left top), which applies audio and video compression independently; (b) modality-symmetric compression (right top), which treats the two modalities equally informative; and (c) modality-asymmetric compression (bottom, ours), which first prunes visual redundancy and then performs visually guided audio compression.
  • Figure 3: Architecture of OmniSIFT, a modality-asymmetric compression framework. The framework operates in two stages. In the first stage, STVP removes spatial and temporal redundancy in video tokens to obtain a compact set of visual anchors. In the second stage, VGAS selects audio tokens conditioned on these visual anchors. The resulting compressed multimodal sequence is then fed into the LLM backbone for downstream reasoning.
  • Figure 4: Ablation results for video and audio compression ratios, evaluated on the Qwen2.5-Omni-7B model using the WorldSense benchmark. Left: Varying the video compression ratio $\rho_v$ with audio compression ratio $\rho_a=0.5$; Right: Varying the audio compression ratio $\rho_a$ with video compression ratio $\rho_v=0.8$.
  • Figure 5: Ablation results for OmniSIFT's architecture. w/o Spatial Component: all visual tokens are selected using temporal saliency only. w/o Temporal Component: all visual tokens are selected based on spatial saliency only. Audio-Only Selector: audio tokens are selected solely based on intra-audio self-attention without any visual guidance.
  • ...and 7 more figures