Table of Contents
Fetching ...

Semantic Residual for Multimodal Unified Discrete Representation

Hai Huang, Shulei Wang, Yan Xia

TL;DR

The paper tackles the challenge of learning robust multimodal unified discrete representations by questioning the benefit of high-precision quantization for cross-modal tasks. It introduces SRCID, a semantic-residual framework that uses a two-layer mutual-information-based disentanglement with a shared codebook, leveraging CLUB for MI minimization and CPC for MI maximization, and employs EMA-based VQ replacement. Empirically, SRCID achieves state-of-the-art results on cross-modal generalization and zero-shot retrieval across AVE, AVVP, UCF-101, MSCOCO, and Clotho, with pretraining on a 40k-VGGsound dataset. The work demonstrates that semantic residuals, coupled with principled information bottlenecks, provide a scalable path to better cross-modal alignment and retrieval, while highlighting training stability considerations such as warm-start and EMA-based updates.

Abstract

Recent research in the domain of multimodal unified representations predominantly employs codebook as representation forms, utilizing Vector Quantization(VQ) for quantization, yet there has been insufficient exploration of other quantization representation forms. Our work explores more precise quantization methods and introduces a new framework, Semantic Residual Cross-modal Information Disentanglement (SRCID), inspired by the numerical residual concept inherent to Residual Vector Quantization (RVQ). SRCID employs semantic residual-based information disentanglement for multimodal data to better handle the inherent discrepancies between different modalities. Our method enhances the capabilities of unified multimodal representations and demonstrates exceptional performance in cross-modal generalization and cross-modal zero-shot retrieval. Its average results significantly surpass existing state-of-the-art models, as well as previous attempts with RVQ and Finite Scalar Quantization (FSQ) based on these modals.

Semantic Residual for Multimodal Unified Discrete Representation

TL;DR

The paper tackles the challenge of learning robust multimodal unified discrete representations by questioning the benefit of high-precision quantization for cross-modal tasks. It introduces SRCID, a semantic-residual framework that uses a two-layer mutual-information-based disentanglement with a shared codebook, leveraging CLUB for MI minimization and CPC for MI maximization, and employs EMA-based VQ replacement. Empirically, SRCID achieves state-of-the-art results on cross-modal generalization and zero-shot retrieval across AVE, AVVP, UCF-101, MSCOCO, and Clotho, with pretraining on a 40k-VGGsound dataset. The work demonstrates that semantic residuals, coupled with principled information bottlenecks, provide a scalable path to better cross-modal alignment and retrieval, while highlighting training stability considerations such as warm-start and EMA-based updates.

Abstract

Recent research in the domain of multimodal unified representations predominantly employs codebook as representation forms, utilizing Vector Quantization(VQ) for quantization, yet there has been insufficient exploration of other quantization representation forms. Our work explores more precise quantization methods and introduces a new framework, Semantic Residual Cross-modal Information Disentanglement (SRCID), inspired by the numerical residual concept inherent to Residual Vector Quantization (RVQ). SRCID employs semantic residual-based information disentanglement for multimodal data to better handle the inherent discrepancies between different modalities. Our method enhances the capabilities of unified multimodal representations and demonstrates exceptional performance in cross-modal generalization and cross-modal zero-shot retrieval. Its average results significantly surpass existing state-of-the-art models, as well as previous attempts with RVQ and Finite Scalar Quantization (FSQ) based on these modals.
Paper Structure (7 sections, 8 equations, 3 figures, 3 tables)

This paper contains 7 sections, 8 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: (a) RVQ (b) Simplified diagram of SRCID
  • Figure 2: SRCID Encoder Framework
  • Figure 3: Ablation of codebook size and club