Table of Contents
Fetching ...

TI-JEPA: An Innovative Energy-based Joint Embedding Strategy for Text-Image Multimodal Systems

Khang H. N. Vo, Duc P. T. Nguyen, Thong Nguyen, Tho T. Quan

TL;DR

This work tackles the challenge of aligning text and image modalities by bridging semantic gaps with TI-JEPA, an energy-based joint embedding predictive architecture. By combining cross-attention between text and image representations with target/context blocks and a predictive objective, TI-JEPA learns robust multimodal encoders while mitigating energy collapse through selective freezing of pretrained components. Empirical results on multimodal sentiment analysis (MVSA-Single and MVSA-Multi) show state-of-the-art performance, with TI-JEPA-Large achieving up to 76.75% accuracy and 74.62% F1 on MVSA-Single and 77.55% accuracy and 75.02% F1 on MVSA-Multi, indicating strong cross-modal understanding and generalization. The approach highlights the potential of energy-based frameworks for broader multimodal fusion tasks and enables scalable pretraining with existing encoders, suggesting practical impact for vision-language systems and downstream applications such as VQA and cross-modal retrieval.

Abstract

This paper focuses on multimodal alignment within the realm of Artificial Intelligence, particularly in text and image modalities. The semantic gap between the textual and visual modality poses a discrepancy problem towards the effectiveness of multi-modalities fusion. Therefore, we introduce Text-Image Joint Embedding Predictive Architecture (TI-JEPA), an innovative pre-training strategy that leverages energy-based model (EBM) framework to capture complex cross-modal relationships. TI-JEPA combines the flexibility of EBM in self-supervised learning to facilitate the compatibility between textual and visual elements. Through extensive experiments across multiple benchmarks, we demonstrate that TI-JEPA achieves state-of-the-art performance on multimodal sentiment analysis task (and potentially on a wide range of multimodal-based tasks, such as Visual Question Answering), outperforming existing pre-training methodologies. Our findings highlight the potential of using energy-based framework in advancing multimodal fusion and suggest significant improvements for downstream applications.

TI-JEPA: An Innovative Energy-based Joint Embedding Strategy for Text-Image Multimodal Systems

TL;DR

This work tackles the challenge of aligning text and image modalities by bridging semantic gaps with TI-JEPA, an energy-based joint embedding predictive architecture. By combining cross-attention between text and image representations with target/context blocks and a predictive objective, TI-JEPA learns robust multimodal encoders while mitigating energy collapse through selective freezing of pretrained components. Empirical results on multimodal sentiment analysis (MVSA-Single and MVSA-Multi) show state-of-the-art performance, with TI-JEPA-Large achieving up to 76.75% accuracy and 74.62% F1 on MVSA-Single and 77.55% accuracy and 75.02% F1 on MVSA-Multi, indicating strong cross-modal understanding and generalization. The approach highlights the potential of energy-based frameworks for broader multimodal fusion tasks and enables scalable pretraining with existing encoders, suggesting practical impact for vision-language systems and downstream applications such as VQA and cross-modal retrieval.

Abstract

This paper focuses on multimodal alignment within the realm of Artificial Intelligence, particularly in text and image modalities. The semantic gap between the textual and visual modality poses a discrepancy problem towards the effectiveness of multi-modalities fusion. Therefore, we introduce Text-Image Joint Embedding Predictive Architecture (TI-JEPA), an innovative pre-training strategy that leverages energy-based model (EBM) framework to capture complex cross-modal relationships. TI-JEPA combines the flexibility of EBM in self-supervised learning to facilitate the compatibility between textual and visual elements. Through extensive experiments across multiple benchmarks, we demonstrate that TI-JEPA achieves state-of-the-art performance on multimodal sentiment analysis task (and potentially on a wide range of multimodal-based tasks, such as Visual Question Answering), outperforming existing pre-training methodologies. Our findings highlight the potential of using energy-based framework in advancing multimodal fusion and suggest significant improvements for downstream applications.

Paper Structure

This paper contains 19 sections, 5 equations, 1 figure, 4 tables.

Figures (1)

  • Figure 1: The proposed TI-JEPA architecture, where cross-attention between text and image encodings is leveraged to predict masked patches.