ITO: Images and Texts as One via Synergizing Multiple Alignment and Training-Time Fusion

HanZpeng Liu; Yaqian Li; Zidan Wang; Shuoxi Zhang; Zonglin Zhao; Zihao Bo; Rinyoichi Takezoe; Kaiwen Long; Kun He

ITO: Images and Texts as One via Synergizing Multiple Alignment and Training-Time Fusion

HanZpeng Liu, Yaqian Li, Zidan Wang, Shuoxi Zhang, Zonglin Zhao, Zihao Bo, Rinyoichi Takezoe, Kaiwen Long, Kun He

TL;DR

This analysis reveals that while multiple alignment drives discriminative power, training-time fusion acts as a critical structural regularizer -- eliminating the modality gap and stabilizing training dynamics to prevent the early saturation often observed in aggressive contrastive learning.

Abstract

Image-text contrastive pretraining has become a dominant paradigm for visual representation learning, yet existing methods often yield representations that remain partially organized by modality. We propose ITO, a framework addressing this limitation through two synergistic mechanisms. Multimodal multiple alignment enriches supervision by mining diverse image-text correspondences, while a lightweight training-time multimodal fusion module enforces structured cross-modal interaction. Crucially, the fusion module is discarded at inference, preserving the efficiency of standard dual-encoder architectures. Extensive experiments show that ITO consistently outperforms strong baselines across classification, retrieval, and multimodal benchmarks. Our analysis reveals that while multiple alignment drives discriminative power, training-time fusion acts as a critical structural regularizer -- eliminating the modality gap and stabilizing training dynamics to prevent the early saturation often observed in aggressive contrastive learning.

ITO: Images and Texts as One via Synergizing Multiple Alignment and Training-Time Fusion

TL;DR

Abstract

Paper Structure (37 sections, 9 equations, 9 figures, 14 tables)

This paper contains 37 sections, 9 equations, 9 figures, 14 tables.

Introduction
Related Work
Method
Preliminaries: CLIP-style Contrastive Pretraining
Multimodal Multiple Alignment
Training-Time Multimodal Fusion
Overall Objective and Inference
Experiments
Implementation Details
Zero-shot Image Classification
Linear Image Classification
Zero-shot Image--Text Retrieval
Transfer to MLLM Benchmarks
Ablation Study
Analysis
...and 22 more sections

Figures (9)

Figure 1: Overview of the proposed ITO training framework. Starting from standard image–text contrastive pretraining, ITO restructures supervision through multimodal multiple alignment and introduces a lightweight multimodal fusion module during training. Multiple augmented image–text pairs derived from the same sample are used to enrich instance-level alignment, while training-time fusion enables structured cross-modal interaction and guides the encoders toward more integrated representations. Importantly, the fusion module is used only during training and is discarded at inference time, allowing ITO to retain a standard dual-encoder architecture for efficient deployment.
Figure 2: Linear image classification of ITO and its variants pretrained on CC3M.
Figure 3: Linear image classification of ITO and its variants pretrained on CC12M.
Figure 4: UMAP visualization. All models are trained on the CC3M dataset. For visualization, 8,192 image-text pairs are randomly sampled from CC12M. Blue points represent images, and red points represent texts. (a): CLIP exhibits a clear separation between modalities, with a distinct boundary between image and text embeddings. (b): FLAIR shows more compact text embeddings surrounded by image embeddings, likely due to its text-conditioned fusion mechanism. (c): Notably, ITO demonstrates a star-shaped distribution, where image and text embeddings are more closely clustered together, effectively dissolving the boundary between modalities.
Figure 5: UMAP visualization. All models are trained on the CC3M dataset. For visualization, 8,192 image-text pairs are randomly sampled from CC12M. Blue points represent images, and red points represent texts.
...and 4 more figures

ITO: Images and Texts as One via Synergizing Multiple Alignment and Training-Time Fusion

TL;DR

Abstract

ITO: Images and Texts as One via Synergizing Multiple Alignment and Training-Time Fusion

Authors

TL;DR

Abstract

Table of Contents

Figures (9)