Table of Contents
Fetching ...

ITO: Images and Texts as One via Synergizing Multiple Alignment and Training-Time Fusion

HanZpeng Liu, Yaqian Li, Zidan Wang, Shuoxi Zhang, Zonglin Zhao, Zihao Bo, Rinyoichi Takezoe, Kaiwen Long, Kun He

TL;DR

This analysis reveals that while multiple alignment drives discriminative power, training-time fusion acts as a critical structural regularizer -- eliminating the modality gap and stabilizing training dynamics to prevent the early saturation often observed in aggressive contrastive learning.

Abstract

Image-text contrastive pretraining has become a dominant paradigm for visual representation learning, yet existing methods often yield representations that remain partially organized by modality. We propose ITO, a framework addressing this limitation through two synergistic mechanisms. Multimodal multiple alignment enriches supervision by mining diverse image-text correspondences, while a lightweight training-time multimodal fusion module enforces structured cross-modal interaction. Crucially, the fusion module is discarded at inference, preserving the efficiency of standard dual-encoder architectures. Extensive experiments show that ITO consistently outperforms strong baselines across classification, retrieval, and multimodal benchmarks. Our analysis reveals that while multiple alignment drives discriminative power, training-time fusion acts as a critical structural regularizer -- eliminating the modality gap and stabilizing training dynamics to prevent the early saturation often observed in aggressive contrastive learning.

ITO: Images and Texts as One via Synergizing Multiple Alignment and Training-Time Fusion

TL;DR

This analysis reveals that while multiple alignment drives discriminative power, training-time fusion acts as a critical structural regularizer -- eliminating the modality gap and stabilizing training dynamics to prevent the early saturation often observed in aggressive contrastive learning.

Abstract

Image-text contrastive pretraining has become a dominant paradigm for visual representation learning, yet existing methods often yield representations that remain partially organized by modality. We propose ITO, a framework addressing this limitation through two synergistic mechanisms. Multimodal multiple alignment enriches supervision by mining diverse image-text correspondences, while a lightweight training-time multimodal fusion module enforces structured cross-modal interaction. Crucially, the fusion module is discarded at inference, preserving the efficiency of standard dual-encoder architectures. Extensive experiments show that ITO consistently outperforms strong baselines across classification, retrieval, and multimodal benchmarks. Our analysis reveals that while multiple alignment drives discriminative power, training-time fusion acts as a critical structural regularizer -- eliminating the modality gap and stabilizing training dynamics to prevent the early saturation often observed in aggressive contrastive learning.
Paper Structure (37 sections, 9 equations, 9 figures, 14 tables)

This paper contains 37 sections, 9 equations, 9 figures, 14 tables.

Figures (9)

  • Figure 1: Overview of the proposed ITO training framework. Starting from standard image–text contrastive pretraining, ITO restructures supervision through multimodal multiple alignment and introduces a lightweight multimodal fusion module during training. Multiple augmented image–text pairs derived from the same sample are used to enrich instance-level alignment, while training-time fusion enables structured cross-modal interaction and guides the encoders toward more integrated representations. Importantly, the fusion module is used only during training and is discarded at inference time, allowing ITO to retain a standard dual-encoder architecture for efficient deployment.
  • Figure 2: Linear image classification of ITO and its variants pretrained on CC3M.
  • Figure 3: Linear image classification of ITO and its variants pretrained on CC12M.
  • Figure 4: UMAP visualization. All models are trained on the CC3M dataset. For visualization, 8,192 image-text pairs are randomly sampled from CC12M. Blue points represent images, and red points represent texts. (a): CLIP exhibits a clear separation between modalities, with a distinct boundary between image and text embeddings. (b): FLAIR shows more compact text embeddings surrounded by image embeddings, likely due to its text-conditioned fusion mechanism. (c): Notably, ITO demonstrates a star-shaped distribution, where image and text embeddings are more closely clustered together, effectively dissolving the boundary between modalities.
  • Figure 5: UMAP visualization. All models are trained on the CC3M dataset. For visualization, 8,192 image-text pairs are randomly sampled from CC12M. Blue points represent images, and red points represent texts.
  • ...and 4 more figures