Table of Contents
Fetching ...

Omni-C: Compressing Heterogeneous Modalities into a Single Dense Encoder

Kin Wai Lau, Yasar Abbas Ur Rehman, Lai-Man Po, Pedro Porto Buarque de Gusmão

TL;DR

Omni-C (Omni-Compress), a single dense Transformer-based encoder that learns competitive shared representations across heterogeneous modalities through unimodal contrastive pretraining on large-scale unaligned data, is proposed, advancing efficient and scalable multimodal learning.

Abstract

Recent multimodal systems often rely on separate expert modality encoders which cause linearly scaling complexity and computational overhead with added modalities. While unified Omni-models address this via Mixture-of-Expert (MoE) architectures with specialized experts and routing, they still inflate parameter counts and introduce routing overhead. In this paper, we propose Omni-C (Omni-Compress), a single dense Transformer-based encoder that learns competitive shared representations across heterogeneous modalities--images, audio, and text--through unimodal contrastive pretraining on large-scale unaligned data. By maximizing parameter sharing in the backbone and using lightweight modality-specific projection heads, Omni-C effectively mitigates inter-modality conflicts without requiring MoE, paired supervision, or routing. This design supports efficient deployment on memory-constrained systems via sequential modality processing and low-memory inference, eliminating the need for parallel expert loading or specialized hardware. Experiments show Omni-C achieves performance comparable to expert models in unimodal and cross-model tasks, with modest zero-shot degradation on audio and text that is largely recovered through lightweight linear probing or parameter efficient fine-tuning. The unified architecture substantially reduces inference memory usage compared to multi-encoder baselines, advancing efficient and scalable multimodal learning.

Omni-C: Compressing Heterogeneous Modalities into a Single Dense Encoder

TL;DR

Omni-C (Omni-Compress), a single dense Transformer-based encoder that learns competitive shared representations across heterogeneous modalities through unimodal contrastive pretraining on large-scale unaligned data, is proposed, advancing efficient and scalable multimodal learning.

Abstract

Recent multimodal systems often rely on separate expert modality encoders which cause linearly scaling complexity and computational overhead with added modalities. While unified Omni-models address this via Mixture-of-Expert (MoE) architectures with specialized experts and routing, they still inflate parameter counts and introduce routing overhead. In this paper, we propose Omni-C (Omni-Compress), a single dense Transformer-based encoder that learns competitive shared representations across heterogeneous modalities--images, audio, and text--through unimodal contrastive pretraining on large-scale unaligned data. By maximizing parameter sharing in the backbone and using lightweight modality-specific projection heads, Omni-C effectively mitigates inter-modality conflicts without requiring MoE, paired supervision, or routing. This design supports efficient deployment on memory-constrained systems via sequential modality processing and low-memory inference, eliminating the need for parallel expert loading or specialized hardware. Experiments show Omni-C achieves performance comparable to expert models in unimodal and cross-model tasks, with modest zero-shot degradation on audio and text that is largely recovered through lightweight linear probing or parameter efficient fine-tuning. The unified architecture substantially reduces inference memory usage compared to multi-encoder baselines, advancing efficient and scalable multimodal learning.
Paper Structure (14 sections, 1 equation, 7 figures, 15 tables)

This paper contains 14 sections, 1 equation, 7 figures, 15 tables.

Figures (7)

  • Figure 1: t-SNE visualization of image, audio, and text embeddings from pretrained Omni-C model. It shows clear separation of image (red), audio (blue) and text (green) clusters. Embeddings are extracted from samples on ImageNet-1K (images), AudioSet (audio spectrograms), and English Wikipedia (text)
  • Figure 2: Unlike the multi-expert (e.g. AudioCLIP guzhov2022audioclip) in (a) and Mixture-of-Experts (MoE) approaches (e.g. , Uni-MoE 2.0-Omni li2025uni2) in (b) which incur linear parameter scaling and routing overhead with added modalities, Omni-C in (c) leverages a single dense Transformer backbone with maximal parameter sharing to achieve competitive unimodal and cross-modal performance while drastically reducing system complexity and inference memory during the deployment. Omni-C model processing multiple heterogeneous modalities (images, audio spectrograms, and text). Images and audio spectrograms are divided into non-overlapping patches and projected via separate 2D convolutional embedding layers, while text sequences are tokenized and projected via a linear embedding layer. A shared learnable global CLS token is prepended to the sequence of embeddings (with modality-specific positional encodings added), and the full sequence is processed by the unified Omni-C Transformer backbone blocks. The final CLS token representation from the backbone is then fed into modality-specific MLP projection heads for unimodal contrastive pretraining.
  • Figure 3: Average self-attention maps from the last ViT-Base Transformer layer with 12 heads over 3000 samples for the pretrained models from downstream datasets. (a-c) show attention maps for the modality-specific expert models on images (KITTI), audio spectrograms (VGGSound), and text (AGNews), respectively, exhibiting focused attention patterns that specialize in modality-specific local features. (d-f) show corresponding attention maps for the unified Omni-C model on the same inputs and datasets, revealing distributed attention that concurrently encodes and integrates information from heterogeneous inputs.
  • Figure 4: Average self-attention maps from the last ViT-Base Transformer layer with 12 heads over 3000 samples after SBoRA downstream datasets fine tuning. (a-c) show attention maps for the modality-specific expert models on images (KITTI), audio spectrograms (VGGSound), and text (AGNews), respectively. (d-f) show corresponding attention maps for the unified Omni-C model on the same inputs and datasets. Importantly, the Omni-C backbone can effectively recover from its distributed attention (optimized for cross-modal generalization) to focused, modality-specific attention patterns through lightweight parameter-efficient fine-tuning (SBoRA)
  • Figure 5: SAIL-based zhang2025assessing alignment workflow. Features are extracted from the image-text pairs in stage 1. A linear probe is trained in the stage 2 for modality alignment. The same workflow is applied for audio-text alignment.
  • ...and 2 more figures