Text-centric Alignment for Multi-Modality Learning
Yun-Da Tsai, Ting-Yu Yen, Pei-Fu Guo, Zhe-Yan Li, Shou-De Lin
TL;DR
This work tackles modality mismatch in multimodal learning by proposing TAMML, a text-centered framework that converts all modalities to text and uses LLMs with in-context learning for text-style translation, modality summarization, and reasoning augmentation. By treating text as a unified semantic space, TAMML enables zero-shot generalization to unseen modality combinations and reduces reliance on retraining for new modalities. Empirical results on real datasets show TAMML outperforms embedding-based baselines across various modality pairs and LLM configurations, with notable gains when using larger models and robust ablations highlighting the contribution of each component. The approach offers a flexible, scalable pathway for real-world multimodal systems where modality availability is dynamic and uncertain, with potential impact on fields from healthcare to finance.
Abstract
This research paper addresses the challenge of modality mismatch in multimodal learning, where the modalities available during inference differ from those available at training. We propose the Text-centric Alignment for Multi-Modality Learning (TAMML) approach, an innovative method that utilizes Large Language Models (LLMs) with in-context learning and foundation models to enhance the generalizability of multimodal systems under these conditions. By leveraging the unique properties of text as a unified semantic space, TAMML demonstrates significant improvements in handling unseen, diverse, and unpredictable modality combinations. TAMML not only adapts to varying modalities but also maintains robust performance, showcasing the potential of foundation models in overcoming the limitations of traditional fixed-modality frameworks in embedding representations. This study contributes to the field by offering a flexible, effective solution for real-world applications where modality availability is dynamic and uncertain.
